Machine Learning For Hydrologic Sciences: An Introductory Overview
Machine Learning For Hydrologic Sciences: An Introductory Overview
3
4 Abstract
5 In recent years the hydrologic community has witnessed a surge of interest in machine
6 learning, driven by its enormous success in various academic and commercial applications,
7 rapidly growing hydrologic data repository, and increasing accessibility to machine learning-
8 enabling hardware and software. This overview is intended for readers who are new to the
9 field of machine learning. It provides a non-technical introduction to fundamental concepts
10 underpinning this field while explaining commonly used algorithms and deep learning
11 architectures placed in a historical context, as well as common practices in applying these
12 techniques. Machine learning applications in hydrologic sciences are summarized next, with
13 a focus on recent studies. To date, Machine learning algorithms have seen use in detecting
14 patterns and events such as extreme weather and land use change, approximation of
15 hydrologic variables and processes such as rainfall-runoff modeling, and mining relationships
16 among variables for identifying controlling factors. Machine learning is also being applied in
17 complementary to process-based modeling for parameterization, developing fast surrogates,
18 and bias correction. Finally, the overview highlights challenges lying in the interpretability,
19 physical consistency, extrapolation robustness, limited sample size, and uncertainty
20 quantification of machine learning models and predictions, as well as the research frontier in
21 integrating physical knowledge with machine learning.
22
23 Graphical/Visual Abstract and Caption
1
24
25 Caption: Machine learning has been used in various hydrologic applications in stand-alone
26 mode or within the workflow of process-based modeling. Arrows indicate information flow.
27
28 1. INTRODUCTION
29 Machine learning is the set of methods and algorithms that enable computers to
30 automatically improve performance through experience. As such, they manifest the “data-
31 driven” reasoning as opposed to “knowledge-driven” reasoning that underpins most physical
32 science disciplines. Since the pioneering research that was conducted in the 1950s (Turing,
33 1950; Rosenblatt, 1958), the field of machine learning has seen dramatic progress. In the
34 1980s, backpropagation (Rumelhart et al., 1986) was found to be effective in training
35 artificial neural networks (ANNs), which led to a surge in machine learning research centered
36 around ANNs and their widespread applications in various disciplines, including hydrology
37 (Buch et al., 1993; Kang et al., 1993; Hsu et al., 1995; Smith and Eli, 1995). Later, support
38 vector machines (SVM, Vapnik, 1995) and other kernel methods (Liang et al., 2007;
39 Hofmann et al., 2008) were discovered and became popular. In recent years, machine
40 learning has become an interdisciplinary area intersecting with computer science, statistics,
41 applied mathematics, and optimization.
42
43 Successful applications of conventional machine learning algorithms typically require
44 a set of customized input features that best represent the raw data for the subsequent learning
45 tasks. Deep learning, a class of machine learning algorithms based on ANNs of multiple
46 layers (thus deep), is capable of automatically discovering appropriate representations from
47 raw data (LeCun et al., 2015). While some deep learning architectures such as Recurrent
2
48 Neural Network (RNN) were invented by the 1990s, widespread interest in deep learning
49 research and applications flourished in the 2010s when low-cost computation and massive
50 online data became increasingly available. Recent advances in machine learning, primarily in
51 the field of deep learning, have brought breakthroughs in computer vision, speech
52 recognition, and natural language processing and have achieved enormous successes in both
53 scientific and commercial applications.
54
55 Inspired by the enormous success reported in the deep learning community and
56 industry, researchers from various scientific disciplines are eager to apply machine learning
57 techniques to problems from their own fields (Ching et al., 2018; Khan and Yairi, 2018;
58 Radovic et al., 2018; Mater and Coote, 2019; Reichstein et al., 2019; Sengupta et al., 2020).
59 In the hydrologic sciences community, a growing interest in machine learning is largely
60 driven by the availability of vast hydrologic data repository (Shen, 2018; Shen et al., 2018).
61 Advances in sensor technology, promotion of hydrologic observatory, and developments of
62 cyberinfrastructure that enables easy sharing of data, have all ushered in an era of data deluge
63 in the form of a plethora of in situ sensor measurements as well as remote sensing imagery.
64 Existing knowledge about the hydrological processes is, therefore, no longer adequate to
65 represent the full range of variability observed in data (Hipsey et al., 2015; Kumar, 2015). In
66 addition, due to the unprecedented volume and complexity of data, the knowledge-driven
67 reasoning alone is not adequate to get the most out of available data. Machine learning, as
68 well as the data-driven reasoning it enables, thus provides exciting opportunities for both the
69 recovery of a full range of variability (thus bringing potentially improved prediction
70 capability) as well as our capacity to discover new knowledge.
71
72 This paper aims to give a broad and non-technical overview of machine learning and
73 its recent applications in hydrologic sciences. We begin this overview by introducing
74 fundamental concepts and terminology. We then briefly describe several popular non-deep
75 machine learning algorithms and deep learning architectures along with common practices of
76 applying these methods. Next, we explore existing research, with a focus on recent studies
77 that apply machine learning in hydrologic sciences. Finally, we conclude with challenges
78 associated with applying machine learning for hydrologic problems and accompanying
79 research opportunities.
80
81 2. MACHINE LEARNING BASICS
3
97 space. Parametric machine learning algorithms make explicit assumptions regarding the
98 format of the function, such as a linear or polynomial function of the input. In contrast,
99 nonparametric alternatives tend to make less assumptions about the form of functions. For
100 quick reference, Table 1 summarizes the above and other key terminology that will be
101 discussed in this section.
102
103
104
105 Table 1. Definition of terms
Term Explanation
The study of intelligence demonstrated by a machine manifested
Artificial intelligence by its capability to perceive the environment and take actions to
(AI) achieve its goals and tasks through flexible adaptation (Kaplan
and Haenlein, 2019).
A subtype of supervised learning where the targets are
Classification
categories or labels.
A class of machine learning algorithms based on artificial neural
Deep learning networks (ANNs) and using hierarchical architectures to extract
higher level features from input data via representation learning.
The process of creating features from raw data that may be
Feature engineering useful for subsequent learning task; typically implemented
manually with domain expertise.
Generalization Measures the prediction capability of a trained machine learning
error/test error model on independent test data unseen during training.
Settings that can be tweaked to change the structure (e.g.,
Hyperparameters/tuning
number of layers in an ANN) and behavior (e.g., smoothness
parameters
preference) of the learning algorithm.
A subset of AI (Fig. 1); learning methods and algorithms that
Machine learning enable computers to automatically improve performance through
experience.
Overfitting occurs when a machine learning model has a high
degree of freedom that cannot be fully justified by the training
Overfitting data. The opposite, underfitting, occurs when a model is too
simple and thus inflexible in representing the range of variability
of the training data.
A subtype of Supervised Learning where the targets are real
Regression
numbers.
A technique intended to reduce the generalization error, often by
Regularization modifying the loss function to penalize deviation from certain
preference (e.g., smoothness).
Techniques that automatically discover representation (or
Representation learning features) that are useful for subsequent learning tasks. Also
known as feature learning.
The computer is given examples consisting of inputs and their
Supervised learning desired targets; the computer is trained on these examples to
learn the input-to-target relationship.
4
The computer is given inputs but no target variables; the goal is
Unsupervised learning
to find underlying patterns in the input data.
106
107
Artificial
intelligence
Machine
learning
Representation
learning
Deep
learning
108
109 Figure 1. The nested concepts of artificial intelligence, machine learning, representation
110 learning, and deep learning. Definitions of the four terms are listed in Table 1.
111
112
113 In the context of hydrology, unsupervised learning techniques can be used, for example, to
114 cluster catchments into groups with distinct hydrologic regimes. Distinguishing different land
115 cover types from multi-spectral satellite images can be formulated as a classification
116 problem, where a classifier needs to learn the mapping from the bands and derived indices
117 (inputs) to land cover classes (labels). A formulation of streamflow forecasting is a regression
118 problem that learns a functional relationship between streamflow with some lead time (target)
119 and inputs such as the past and forecasted meteorological conditions and past streamflow
120 data, given historical examples of the inputs and corresponding target, and by minimizing
121 mean squared error (performance metric). These problems can be approached using various
122 machine learning algorithms that differ in the choices of hypothesis space, loss/objective
123 function, and optimization method. Below we provide a brief, intuitive descriptions (along
124 with references) of several classical machine learning and deep learning algorithms that have
125 been applied in hydrologic sciences. Readers are also referred to Shen et al. (2018) for a
126 transdisciplinary review of deep learning and Tahmasebi et al. (2020) for a review of
127 machine learning algorithms commonly used in geosciences focused on porous media
128 problems. Readers who are interested in a more comprehensive, in-depth discussion of
129 machine learning theory and algorithms may refer to Mitchell (1997), Hastie et al. (2009),
130 and Goodfellow et al. (2016). Besides, Géron (2019) provides hands-on guide to machine
131 learning and deep learning with working code.
132
133 2.1. Conventional machine learning algorithms
5
139 distance (Irani et al., 2016). A popular clustering algorithm is K-means, which takes a
140 random initialization of the cluster assignment, and then iteratively minimizes the within-
141 cluster point scatter measured by, for example, Euclidean distance in the input space, until
142 convergence (MacQueen, 1967; Hartingan and Wong, 1979). The within-cluster point scatter
143 is defined as the sum of the distance between every pair of data points assigned to the same
144 cluster.
145
146 Over the past few decades, variants of K-means and other algorithms such as
147 agglomerative hierarchical clustering and fuzzy clustering have been proposed and used in
148 various applications (de Oliveira and Pedrycz, 2007; Jain, 2010; Murtagh and Legendre,
149 2014; Tennant et al., 2021). Although clustering is an unsupervised learning technique, it is
150 sometimes used to learn data representation in the pre-processing step for a supervised
151 learning task. For example, the cluster assignment can be used to produce new features on top
152 of the raw input variables (Coates et al., 2011).
153
154 2.1.2. Lasso
155 Least Absolute Shrinkage and Selection Operator (Lasso) is a widely used regression
156 method that adds an 𝐿1 penalty term (the sum of absolute value of linear regression
157 coefficients, 𝑏𝑖 , 𝑖 = 1, … , 𝑝, Table 2) to the ordinary least squares loss function in order to
158 keep the regression coefficients small (Tibshirani, 1996). Because of the 𝐿1 regularization,
159 Lasso typically sets some of the regression coefficients to zero. The number of zero
160 coefficients depends on the penalty hyperparameter, which is usually determined through
161 cross validation. As such, the algorithm performs both feature selection and parameter
162 estimation simultaneously, and has been widely used for high dimensional regression
163 problems. In addition, Lasso can be used for classification when combined with logistic
164 regression (Hosmer et al., 2013). Due to its good generalization performance, sparsity and
165 interpretability, Lasso has been used in various applications (e.g., Anda et al., 2018; Vandal
166 et al., 2017).
167
168 Table 2. Comparison of the representation of input variables by five supervised
169 machine learning algorithms (Lasso, SVM, GPR, CART, and ANN).
Algorithm Representation
Lasso 𝑇
𝐱 = [𝑥1 , … , 𝑥𝑝 ] , original inputs
SVM & GPR 𝜙(𝐱), inputs projected to a higher dimensional feature space
CART 𝟏{𝐱 ∈ 𝑅𝑖 }, indicator function that equals 1 if 𝐱 is in the leaf 𝑅𝑖 and 0
otherwise.
ANN 𝑓𝑑 (… 𝑓2 (𝑓1 (𝐱))), output of the last hidden layer
170
171
172 2.1.3. Support vector machine (SVM)
173 Support vector machine (SVM) is believed to be among the most robust prediction
174 methods because it seeks to minimize an upper bound of the generalization error rather than
175 the training error (Vapnik, 1998). In addition, the solution is globally optimal under
176 conditions that can often be met, while other machine learning algorithms such as ANN may
177 converge to local minima. The SVM algorithm maps the input variables to a higher
178 dimensional feature space, 𝜙(𝐱) (Table 2). The map is usually implemented implicitly via a
179 kernel function, also known as the kernel trick. The kernel function is analogous to the
180 covariance function in the context of Gaussian process (Section 2.1.4). For classification
6
181 tasks, SVM identifies the optimal separating hyperplanes in the feature space while
182 maximizing the margin between classes. Kernel trick enables SVM to classify data points that
183 are not linearly separable in the original input space. For regression tasks, SVM minimizes an
184 objective function composed of loss greater than a specified threshold and a 𝐿2 regularization
185 term. Ideally, the choice of kernel function should be made based on structure of the input
186 data and their relation to the output. Lastly, it is worth noting that the model produced by
187 SVM is represented sparsely as the linear combination of a subset of the training data
188 (“support vectors”) projected into the feature space.
189
190 2.1.4. Gaussian process regression
191 Gaussian process regression (GPR) is a Bayesian kernel regression method and has
192 been shown to perform well in a variety of benchmark applications. A GP refers to a set of
193 random variables, indexed in space and time, that have a joint multivariate Gaussian
194 distribution. A GP is fully specified by a mean function and a covariance function that
195 describes the covariance between each pair of the random variables (i.e., the quantity of
196 interest at two separate locations/times). The two functions should reflect the prior
197 knowledge of the general trend and smoothness of the target function, respectively. The use
198 of covariance function is analogous to the kernel trick of SVM (Rasmussen and Williams,
199 2006) and implicitly maps the inputs to features 𝜙(𝐱) (Table 2). GP is also used by kriging
200 methods in geostatistics, where the mean and covariance are typically specified as functions
201 of spatial coordinates. In the context of machine learning, the independent variables of mean
202 and covariance functions include explanatory variables, thus enabling GPR to approximate
203 complex, nonlinear relationships between the target and inputs (features). Starting from the a
204 priori (i.e. before seeing any data) mean and covariance, GPR uses the Bayes’ Theorem to
205 infer the posterior distribution of the target conditioned on the training data. Fig. 2a shows
206 samples drawn from a GP with a mean function that a priori follows a linear function of the
207 input; in practical applications such prior knowledge should be incorporated when available.
208 After training data is introduced, samples can be drawn from the posterior of the GP
209 conditioned on training data (Fig. 2b). As such, GP regression is a probabilistic approach that
210 explicitly derives the uncertainty associated with the predictions. As the test data moves away
211 from the range of training data, the prediction given by GPR will converge to the prior mean
212 with a wide prediction interval (uncertainty) (Fig. 2b). This is sometimes a preferred behavior
213 when extrapolating with a function such as polynomial may lead to problematic results.
214 Unlike the sparsity of SVM, exact GPR prediction at an unseen data point is a linear
215 combination of all training data points, with the weights estimated based on the covariance
216 function. Therefore, a disadvantage of GPR is that its computational cost with maintaining
217 and operation of the covariance matrix can be prohibitive for large datasets. To overcome this
218 difficulty and improve GPR scalability for big data, various approximation methods have
219 been developed (Liu et al., 2020).
220
7
221
222 Figure 2. Schematic of Gaussian process regression showing a Gaussian process (a)
223 prior based on a linear mean function and a squared exponential covariance function,
224 and (b) posterior conditioned on training data. Dark line shows the prior and posterior
225 means, respectively, and grey lines are random samples drawn from the GP. Red open
226 circles are training data points, and they “sculpt” the prior into the posterior.
227
228 2.1.5. Decision trees and forests
229 Decision trees are a conceptually simple nonparametric machine learning algorithm.
230 Here we briefly describe the classification and regression trees (CARTs). A CART
231 recursively partitions the feature space into rectangular regions using a sequence of binary
232 splits. Each time, the CART chooses a splitting variable from all input variables and
233 threshold to maximize the goodness-of-fit after this split. The process is repeated until a user-
234 specified minimum number of data points is reached at the leaves, or terminal nodes. Each
235 leaf represents a rectangle region in the input space, denoted as 𝑅𝑖 , 𝑖 = 1, … , 𝑁 with N
236 denoting the total number of leaves, and CART fits a constant value 𝛼𝑖 to 𝑅𝑖 . For an unseen
237 data point 𝐱 ∗ , CART prediction is a linear combination of the values of each leaf, i.e.
238 ∑𝑁 ∗ ∗
𝑖=1 αi 𝟏{𝐱 ∈ 𝑅𝑖 }, where 𝟏{𝐱 ∈ 𝑅𝑖 } is an indicator function equal to 1 if 𝐱 falls within the 𝑖-
239 th leaf and zero otherwise (Table 2). In its essence, a CART estimates a piecewise constant
240 function. It is a common practice to prune the tree to a subtree to prevent overfitting. A major
241 advantage of decision trees is their interpretability. One disadvantage of decision trees is their
242 statistical instability even after pruning. In other words, small perturbation or noise in the
243 training data may result in substantially different structure of the learned tree (Hastie et al.,
244 2009).
245 To overcome the aforementioned disadvantages, forests that are based on multiple
246 trees have been proposed. For example, the Random Forests (RF) is an ensemble learning
247 method proposed by Breiman (2001) based on bootstrap aggregation (i.e., bagging). A RF
248 consists of multiple CARTs, with each CART grown on a bootstrap sample (i.e., sample with
249 replacement) of the training data. Each bootstrap sample leaves out about one-third of the
250 data, which are called the out-of-bag observations. The out-of-bag (oob) error is an estimate
251 of generalization error and can be used to calculate the importance scores of input variables.
252 To reduce correlation between trees, another design feature of RF that enhances performance
253 is that at each split, the splitting variable is selected among a randomly chosen subset of input
254 variables. After all the CARTs have been grown, the prediction for an unseen data point is
255 calculated as the average of predictions from each individual CART. While being less
256 interpretable than decision trees, RF calculates input variable importance scores that provide
257 valuable information about the dominant factors affecting the target variable. Other popular
8
258 tree ensemble algorithms include XGBoost (Chen and Guestrin, 2016) and gradient boosting
259 machine (Friedman, 2001; Ke et al., 2017), which build the forest based on boosting
260 algorithms.
261
262 2.1.6. Artificial neural network
263 Artificial neural networks (ANNs) have been widely applied to various fields
264 including hydrology. Inspired by biological learning processes, ANNs are built out of a
265 densely interconnected set of units. Here we briefly describe the feedforward neural
266 networks, or multilayer perceptron networks (MLP). A typical MLP network consists of an
267 input layer, one or more hidden layers and an output layer. Fig. 3a shows an example of an
268 MLP with one hidden layer. For MLPs, information flows through the connections between
269 units. Each unit, or neuron, computes a single output by passing the weighted sum of its
270 inputs plus a bias term through a typically smooth, nonlinear activation function (e.g.,
271 sigmoid or rectifier). Using multiple hidden layers, an ANN learns a representation of the raw
272 input, 𝐱, as a recursive function 𝑓𝑑 (… 𝑓𝑗 … (𝑓2 (𝑓1 (𝐱)))), where 𝑓𝑗 is the activation function
273 that takes a vector input (which could be the raw input or output of neurons from the prior
274 layer) and outputs a vector. The output layer computes the final output as the linear
275 combination of the learned representation (the output of the last hidden layer), 𝑓𝑑 (Table 2).
276
277 The weights and biases are learned using the backpropagation algorithm.
278 Backpropagation first evaluates the output values of each neuron in a forward pass of
279 information. Second, it calculates the partial derivative of the loss function with respect to
280 each learnable weights and biases. It then updates the weights and biases according to the
281 partial derivatives in a backward pass through the layers. A hyperparameter, the learning
282 rate, affects the size of the update. The process is repeated, resulting in a gradient descent
283 approach.
284
285 ANNs are considered to have high representational power. It has been proven that a
286 MLP with three layers can approximate any function to arbitrary accuracy given sufficient
287 units (Cybenko, 1989; Mitchell, 1997). A major shortcoming of MLPs is that the
288 backpropagation algorithm is only guaranteed to converge to some local minimum. Research
289 interests in ANNs have been revived in the last decade in the context of deep learning, which
290 is discussed in Section 2.3.
9
291
292 Figure 3. The architecture of (a) a fully connected ANN and (b) a CNN for classifying
293 hand written digits. The ANN has one hidden layer, within which each neuron applies
𝑻
294 an activation function on the linear combination of inputs 𝐱 = [𝒙𝟏 , … , 𝒙𝒑 ] , the flattened
295 pixel values of the input image. The CNN applies convolution with multiple kernels,
296 pooling, an activation function, followed by a fully connected layer for final output
297 (Section 2.3.2).
298
299 2.2. Model Selection
301 All the supervised machine learning algorithms described in Section 2.1 can be
302 viewed as learning the target function that is a linear combination of features or
303 representations. As summarized in Table 2, the algorithms differ at how
304 features/representations are constructed. In the simplest case of linear regression, the raw
305 input variables are directly used as features. Lasso takes one step further, by learning whether
306 the weights are exactly zero or not. SVM and GPR use a user specified kernel (covariance) to
307 implicitly embed the input into a higher dimensional feature space. CART learns a
308 representation that adaptively partitions the input space into rectangular regions. The
309 representation learned by ANN is the output from the last hidden layer, which can be written
310 as a recursive function. Unlike the other algorithms reviewed in Section 2.1., ANN is not
10
311 restricted to a particular type of representations and can automatically extract information
312 from raw inputs. This gives ANNs and deep networks high representation power, which is
313 further discussed in Section 2.3.1.
314
315 The choice of machine learning algorithms is often made application specific. While
316 the primary decision factor is the prediction accuracy of the algorithms (generalization
317 performance, Section 2.2.2.), empirical studies on various benchmark datasets have suggested
318 that tree ensemble algorithms generally work well (Fernández-Delgado et al., 2014; 2019).
319 This is because tree-based algorithms have built-in capability of variable selection and
320 accounting for interaction among input variables. However, many hydrologic applications
321 involve target functions that exhibit local smoothness. In this case, it may be more
322 advantageous to use methods such as SVM and GPR, which can enforce local smoothness by
323 choosing an appropriate kernel (e.g., the squared exponential kernel). As will be discussed in
324 Section 2.3.1, deep networks typically outperform conventional machine learning algorithms
325 when dealing with unstructured data such as texts, images, and videos because of their
326 capability of automatic representation learning.
327
328 While generalization performance is arguably the most important consideration for model
329 selection, it is sometimes desirable to select algorithms with high interpretability. For
330 example, Lasso produces a parsimonious linear model and is therefore easy to interpret.
331 Besides, decision trees learn a hierarchical model structure that can be easily visualized;
332 however, tree ensemble methods are less interpretable.
333
334 2.2.2. Generalization Performance
335 Generalization error, used interchangeably with test error, is defined as the expected
336 prediction error, as measured by a given metric, over unseen data points, yielded by a
337 machine learning model trained on a given training dataset. In contrast, the training error
338 refers to the average error over the training data points. Commonly used error metrics include
339 0-1 loss (0 if a data point is correctly categorized and 1 otherwise) for classification and mean
340 squared error and log likelihood for regression tasks. Because prediction is a central goal of
341 both data-driven and process-based modeling efforts. estimating generalization error is
342 critical for gaining confidence in a particular model for prediction tasks and selecting the best
343 model and/or hyperparameters from a set of candidates.
344
345 Unsurprisingly, the capability of a model to fit a given set of training data increases as
346 its complexity increases. An underfitting model will generalize poorly because it is not
347 complex enough to capture the range of variability of the target function. For example, an
348 ANN with 1 hidden layer and 1 hidden unit will likely fit the data poorly; as more layers and
349 hidden nodes are added to the ANN, both the training and test errors decrease because of the
350 added representation power. However, when the model complexity exceeds the degree that
351 can be justified by the training data, the model becomes overfitted: although training error
352 continuously decreases, test error starts to increase (Fig. 4). An overly complex model
353 overfits the training data in that it may extract some of the noise. Consider as an example
354 training an ANN with 𝑀 hidden units to fit 𝑛 data points that follow Gaussian distribution
355 with zero mean and unit standard deviation. When 𝑀 ≥ 𝑛, the ANN can fit the data perfectly.
356 However, it tends to fail at generalizing to data it has not seen before. Besides number of
357 parameters (weights for ANNs), model complexity is also manifested by the size of the
358 parameters. When training an ANN, it is often observed that as training epochs elapse,
359 training error decreases as the weights are adjusted and the model gets better at fitting the
11
360 training data. However, at some point the generalization error starts to increase (Prechelt,
361 1998).
362
363 The general trend of training and test errors can be explained by statistical learning
364 theory. Assuming that data points in the training and test sets are independent, identically
365 distributed, it can be shown that the training error is usually lower than the test error. The
366 expected squared error of a trained model on an unseen data point can be decomposed into
367 three terms. The first term represents irreducible error and is the variance of the measurement
368 error associated with the target. The second term is the square of the bias caused by the
369 hypothesis space of the learning method, such as approximating a nonlinear function with a
370 linear model. The third term is the variance of the fitted model. There is usually a tradeoff
371 between bias and variance. A more complex model yields lower bias at the expense of higher
372 variance and thus may be prone to overfitting (Hastie et al., 2009).
373
374 In order to find the model that will yield low generalization error, the common
375 practice is to randomly divide the dataset into training, validation, and test subsets. Shuffling
376 is recommended so that the three subsets are approximately from the same distribution. A
377 model is repeatedly fitted to the training set, each time using a different set of
378 hyperparameters or machine learning algorithms. The generalization error of the fitted
379 models will then be evaluated on the validation set. Finally, the best-performing combination
380 of machine learning algorithm and hyperparameters is selected and evaluated with the test
381 set.
382
383
384 Figure 4. Schematic of trends in training and generalization errors as the model
385 becomes more complex. When the model complexity increases, training error overall
386 tends to decrease while test error increases, despite temporary fluctuations.
387
388 Some machine learning algorithms have their own implementations for estimating
389 generalization error. For example, random forest uses the out-of-bag error as an estimate.
390 Cross-validation (CV) is a model-generic approach that can be applied to essentially any
391 machine learning algorithm. CV is arguably the simplest technique for estimating the
392 generalization error of a learned model and is routinely used for hyperparameter selection.
393 CV partitions the examples (with known inputs and target) into a training and a validation
394 set. A machine learning model is built based on the training set, and its performance is
395 assessed on the validation set. Multiple rounds are performed, each time using a different data
396 partition. The resulting error metrics (e.g., misclassification rate, mean squared error) on the
397 validation set are combined to estimate the generalization error of the model. Various
398 implementations of CV exist, differing in how data is partitioned. Two commonly used
12
399 implementations are leave-one-out (validation set consists of a single datapoint) and k-fold
400 CV (validation set is one of 𝑘 subsets).
401
402 A common practice to prevent overfitting and improve generalization performance is
403 using regularization strategies. During training, the machine learning algorithm seeks to
404 minimize the loss function that evaluates the misfit between the model and the given targets.
405 It may be desirable to impose preference to other behaviors of the learned model such as
406 smoothness and sparsity. In order to achieve this goal, regularization techniques add a penalty
407 to the loss function; the 𝐿1 and 𝐿2 norms of learned coefficients are often used as penalty,
408 such as in Lasso and SVM, respectively. In addition to explicitly representing preference via
409 a penalty term, regularization may be implemented implicitly. For example, the pruning
410 technique reduces the complexity of a CART and alleviates overfitting. Training of ANNs
411 often employs the early stopping strategy, which monitors the test error on a validation set
412 and terminates the training when the test error continuously increases (Fig. 4). Regularization
413 techniques specifically designed for deep learning will be described in Section 2.3.
414
415 2.2.3. Curse of dimensionality and variable selection
416 In addition to the choice of machine learning algorithms and hyperparameters, the
417 generalization error is affected by the selection of input variables. In hydrologic applications,
418 a variety of observed and derived data may provide some information towards the problem of
419 interest. However, including all relevant features pose challenges to machine learning
420 algorithms, known as the curse of dimensionality (Hastie et al., 2009). Dimension reduction
421 techniques can be used to reduce input dimensionality and improve efficiency. For example,
422 the principal component analysis (PCA) is a commonly used dimension reduction method,
423 which extracts linear combinations of input variables that explain most of the variability in
424 data and then uses the combinations as inputs to machine learning algorithms. A related
425 method, linear discriminant analysis (LDA), is a supervised dimension reduction method that
426 takes the target variable (i.e., class labels) into consideration when extracting linear
427 combinations of input variables (Izenman, 2013).
428
429 Dimension reduction can also be formulated as a variable selection problem, which
430 has been studied extensively in the literature (George, 2000; Guyon and Elisseeff, 2003;
431 Liang et al., 2008). Classical variable selection methods include backward elimination where
432 variables are sequentially removed from the full model, forward selection where variables are
433 sequentially added to the model, or combination of both (Blanchet et al., 2008). A variety of
434 selection criteria can be used to determine which variable to remove or add, such as F-tests, t-
435 test, Akaike information criterion (AIC) and Bayesian information criterion (BIC) (Burnham
436 and Anderson, 2004). In addition to these generic methods, some supervised machine
437 learning algorithms have built-in variable selection function. Examples include Lasso
438 (Section 2.1.2), CART and random forests (Section 2.1.5). For some applications, it may be
439 beneficial to perform PCA/LDA to obtain a reduced set of input variables before using Lasso
440 or trees and forests. Although the above-mentioned automatic variable selection techniques
441 are powerful tools to reduce the input dimension, they should not replace careful feature
442 selection based on expert knowledge whenever such knowledge is available.
443 2.3. Deep learning
445 Conventional machine learning techniques often do not perform well for complex
446 tasks such as computer vision, speech recognition, and natural language processing. These
13
447 tasks involve large volumes of natural data in the raw form, such as images, videos and text.
448 Consider as an example an intensively studied benchmark, the MNIST (Modified National
449 Institute of Standards and Technology) database. The database consists of normalized
450 grayscale scanned images of digits (0 to 9) handwritten by human individuals. When
451 applying a conventional machine learning algorithm, the pixels within an image are typically
452 unfolded (or flattened) into a vector, and each pixel is treated independently. An ANN can be
453 constructed with 𝑝 input units, 𝑝 being the total number of pixels within an image, and
454 multiple hidden layers. These layers are fully connected in that the learning process will
455 attempt to learn the weights connecting each pair of units in adjacent layers (Fig. 3a), leading
456 to a large number of learnable parameters. This greatly increases the need for training data
457 points to make the learning problem well posed and the difficulty for an optimization
458 algorithm to find a solution. In addition, the pixel representation of an image does not
459 account for spatial correlation among pixels and lacks certain invariant features such as
460 rotation and shift.
461
462 For many applications including the MNIST benchmark, careful handcrafting of
463 features from raw data has been critical to achieve good performance with conventional
464 machine learning algorithms. This feature engineering process relies on substantial manual
465 efforts and domain expertise, and is application specific. When dealing with a large volume
466 of data that have complex and nonlinear patterns, conventional machine learning with the
467 handcrafted features is not flexible enough to extract these patterns (Najafabadi et al., 2015).
468 Representation learning replaces manual feature engineering and automatically extracts,
469 using a general-purpose learning procedure, representations of the raw data that might be
470 useful for subsequent supervised learning tasks. Deep learning architectures stack multilayer
471 neural networks to learn such representations. Each layer can be thought of as learning one
472 aspect of the underlying structure of the data, and stacking layers composites the structures
473 learned by individual layers. Research on deep learning theory suggests that such distributed
474 representation endows deep learning with exponential advantages over conventional learning
475 algorithms based on local representation (Bengio et al., 2013). It has been shown that deep
476 networks can be efficiently trained by gradient descent methods (Rumelhart et al., 1986;
477 Glorot et al., 2011), and greater depth generally leads to better generalization performance
478 (Bengio et al., 2007; Ciregan et al., 2012; Goodfellow et al., 2016).
479
480 Deep learning techniques take advantage of fast GPUs and increasing data availability
481 and have achieved record performance in various computer vision, speech recognition and
482 natural language processing tasks. They have also been shown to hold great promise in many
483 domains of science and engineering. In this subsection, we briefly describe some of the deep
484 learning architectures that are the most relevant to hydrologic applications.
485
486 2.3.2. Convolutional Networks
487 In order to overcome the limitations of traditional ANNs on the MNIST database,
488 LeCun et al. (1990; 1998) handcrafted neural network architecture with locally connected
489 layers and shared weights. These neural networks significantly outperformed the fully
490 connected ANNs on experiments centered around the MNIST database. These pioneering
491 efforts led to the development of convolutional networks (CNNs). In 2012, a deep and wide
492 CNN model, AlexNet (Fig. 5, Krizhevsky et al., 2012) was proposed and won the ImageNet
493 Large Scale Visual Recognition Challenge and outperformed all conventional machine
494 learning and computer vision approaches. As of today, CNNs have achieved remarkable
495 successes in computer vision and related areas. Designed for multi-dimensional arrays, CNNs
14
496 use convolution operations in place of fully connected matrix multiplication. A convolutional
497 layer applies a kernel (or filter) that calculates a local weighted sum as the kernel slides
498 through the input array. The number of learnable weights depends only on the kernel size and
499 is usually much smaller than the size of the input array. Multiple kernels can be applied
500 simultaneously to output a multi-channel image (Fig. 3b). Such sparse connectivity is the key
501 advantage of CNN over classical ANNs with full connectivity (Goodfellow et al., 2016). The
502 local weighted sums are then passed through a nonlinear activation layer, such as ReLU that
503 applies the rectifier activation 𝑚𝑎𝑥(0, 𝑥), where 𝑥 is the local weighted sum. In this way, the
504 convolutional layer extracts local motifs of the input array or output from the previous layer.
505 Subsequently, a pooling layer merges local features by calculating local statistics (such as
506 max) to reduce the dimension of representation (Fig. 3b) and preserve shift invariance
507 properties. Multiple convolutional, nonlinear, and pooling layers can be stacked (Fig. 5) to
508 extract hierarchical patterns where higher-level features are derived by composing lower-
509 level features (LeCun et al., 2015). Finally, the high-level features are usually flattened
510 before passing through a fully connected layer for classification or regression (Fig. 5).
511
512 Figure 5. The architecture of the AlexNet (Krizhevsky et al., 2012) consists of
513 convolution, max-pooling, local response normalization (LRN), ReLU and fully
514 connected (FC) layers.
515
516 2.3.3. Recurrent Neural Networks for Sequence Modeling
517 Recurrent Neural Networks (RNNs) are designed for modeling sequential data such as
518 sentences and time series with some underlying temporal dynamics. An RNN digests one
519 element (e.g., a word, the quantity of interest at one time step) of the input sequence at a time
520 and uses its hidden units to keep information learned from the past elements of the sequence.
521 Therefore, we can “unroll” the RNN and consider it as a chain of recurrent neurons, each
522 corresponding to one time step (Fig. 6). Similarly to the sparse connectivity of CNNs (i.e.,
523 sharing weights across different locations of the input multidimensional array), RNNs share
524 weights across different locations (in time) in the input sequence. While the RNN architecture
525 can represent complex dynamics, its training suffers from the well-known vanishing gradient
526 problem. The backpropagated gradients either grow or shrink at each time step; after many
527 time steps, the gradients will either explode (leading to unstable optimization) or, more likely,
528 vanish. Almost zero gradients greatly slow down the learning process because each iteration
529 would apply a very small update to the weights (Bengio et al., 1994; Hochreiter, 1998).
530
531 Long-short term memory (LSTM) is a recurrent neural network (RNN) architecture
532 proposed to overcome the vanishing gradient problem. LSTM and its variants have proven
533 powerful for learning long-term dependencies in time series (Graves et al., 2012; Greff et al.,
15
534 2017). Each LSTM cell corresponds to one time step, repeats to form N recurrent layers, and
535 retains past information in cell memory. Fig. 6 shows the classical LSTM architecture
536 (Hochreiter and Schmidhuber, 1997). At each time step 𝑡, the current input 𝑥𝑡 is combined
537 with hidden state (ℎ𝑡−1 ) and cell memory (𝑐𝑡−1 ) from the previous time step to determine
538 whether the input will be accumulated to cell memory 𝑐𝑡 according to the input gate 𝑖𝑡 and
539 whether the past cell memory 𝑐𝑡−1 will be forgotten according to the forget gate 𝑓𝑡 . The
540 output gate 𝑜𝑡 then determines whether the hidden state ℎ𝑡 will be updated with the cell
541 memory 𝑐𝑡 .
542
543 Figure 6. A recurrent neural network (RNN) with LSTM cells. At time step 𝑡, 𝑥𝑡 is the
544 current input, 𝑐𝑡 is the cell memory, ℎ𝑡 is hidden state, 𝑖𝑡 , 𝑓𝑡 , 𝑜𝑡 are the input, forget, and
545 output gates, respectively, 𝑔𝑡 is the cell input activation vector, and ⊙ denotes element-
546 wise array multiplication.
547
548 2.3.4. Other popular architectures
16
564 (Vincent et al., 2010). Recently, several Bayesian autoencoders have been proposed, known
565 as variational autoencoders, since variational algorithms are used to learn the probabilistic
566 description of the latent representation (Kingma and Welling, 2013; Sønderby et al., 2016). In
567 the Bayesian version of autoencoders, the encoder produces the (approximated) posterior
568 distribution of the latent representation, and the decoder samples one or more realizations
569 from the estimated posterior to generate reconstructions of the original input. The variational
570 autoencoders are believed to be a promising approach for inferring causal relationships.
571
572 Generative adversarial network (GAN) is another architecture for generative learning.
573 GAN learns to generate new data with the same statistics as a given training set (usually
574 images). A generative network and a discriminator compete with each other in the form of a
575 zero-sum game (Goodfellow et al., 2014; Creswell et al., 2018). The generative network,
576 typically based on deconvolutional layers, synthesizes candidates that are similar to the
577 training data with the objective to “fool” the discriminator network, while the discriminator
578 attempts to distinguish synthesized candidates from the true data. Through this process, the
579 GAN gets better at generating synthetic data that resemble the training data. Because the
580 generative network is implicitly trained through the discriminator, and the discriminator is
581 being updated, GAN is particularly suitable for unsupervised learning although it can also be
582 used for supervised and semi-supervised learning where training data are scarce. GANs have
583 attracted wide attention due to potential use for malicious applications such as producing fake
584 photographs and videos. As discussed in Section 3.2.1, GANs have important applications in
585 inverse modeling of geologic media.
586
587 Finally, in recent years attention has become a very influential idea in the deep
588 learning community. Attention enables a deep network to focus on certain parts of the input
589 data in a way similar to how human beings would pay attention to different regions of an
590 image or correlate words at different locations in sentences. This is achieved through learning
591 importance weights that describe how strongly the target is correlated to the elements of input
592 data. There are various attention mechanisms designed to accompany CNNs and RNNs
593 among other deep learning architectures, and they have achieved high performance for many
594 tasks such as image captioning (Vinyals et al., 2015) and translation (Vaswani et al., 2017;
595 Chaudhari et al., 2020).
596
597 2.3.5. Common practices and other considerations
598 Learning the weights for a deep network is usually a hard problem, and standard
599 gradient descent and random initialization often perform poorly (Glorot and Bengio, 2010).
600 As a result, various initialization strategies and variants of gradient descent have been
601 proposed (e.g., Bottou, 2010; Saxe et al., 2011; Sutskever et al., 2013; Kingma and Ba,
602 2015). Because deep learning often deals with very large amounts of data posing
603 computational challenges, a common practice is to divide the datasets into small subsets,
604 called a mini-batch. At each iteration, a mini-batch is loaded and backpropagation is
605 executed, leading to mini-batch gradient descent (Li et al., 2014). This is repeated until all
606 mini-batches have been used, concluding one epoch. The training process lasts for multiple
607 epochs; the number of epochs is a user-specified parameter but may be determined using the
608 early stopping strategy. Another hyperparameter that plays an important role in the training
609 and generalization performance of deep networks is the learning rate. At the simplest form it
610 can be specified as a constant hyperparameter. A number of methods have been developed
611 recently that adapt the learning rates and training progresses, such as Adam (Kingma and Ba,
612 2015).
17
613
614 The regularization strategies for conventional machine learning algorithms discussed
615 in Section 2.2.2 mostly apply to deep learning as well. In addition to those strategies, dropout
616 (Srivastava et al., 2014) is a computationally efficient but powerful method specifically
617 designed for deep learning. Dropout can be thought of as a practical approximation to the
618 idea of bagging in ensemble learning (such as the random forest). Traditional bagging
619 requires training and retaining multiple models and would become computationally
620 unaffordable for very large neural networks. Dropout omits a portion (as determined by
621 dropout rate) of the weights during training, thus regularizing the complexity (and variance)
622 of the learned network. More precisely, each time a batch is loaded, only the weights of a
623 randomly selected subset of the neurons will be updated by backpropagation. The added cost
624 of applying dropout at each step to a specific network is negligible. It was shown that dropout
625 is more effective than other regularization methods including 𝐿1 and 𝐿2 -norm based
626 (Srivastava et al., 2014).
627
628 As discussed above, many aspects of a deep learning algorithm are controlled by a
629 handful of hyperparameters such as learning rate and dropout rate. It is often the case that at
630 least some of the hyperparameters need to be tuned to improve the model’s generalization
631 performance. Methods such as grid-search work well for conventional machine learning
632 methods but may become computationally expensive for deep learning. For an overview of
633 automatic hyperparameter optimization algorithms and general recommendations for manual
634 tuning, readers are referred to Goodfellow et al. (2016) and Hutter et al. (2019).
635
636 3. APPLICATIONS IN HYDROLOGIC SCIENCES
638 3.1.1. Detecting patterns and events from remote sensing data
639 The recent growth in hydrologic observatory has been boosted largely by increasing
640 availability of remote sensing data. Remote sensing provides measurements directly or
641 indirectly related to the water cycle with unprecedented spatial coverage. While some
642 products have been available for decades, recently remote sensing is increasingly used as
643 more products become available and cyberinfrastructure advances lower the barriers to
644 accessing and using these data. Particularly in areas where in situ monitoring networks are
645 sparse or missing, remotely sensed data are an important source of information for large scale
646 monitoring of patterns and events related to hydrologic sciences as well as estimating key
647 hydrologic variables. This section briefly reviews applications in which machine learning is
648 used for classification, and regression applications will be discussed in Section 3.1.2.
649 Machine learning is being used to identify water-related land cover changes and land
650 surface features from remote sensed data (Fig. 7), often leveraging cloud computing
651 platforms (e.g., Google Earth Engine, Gorelick et al., 2017) to process large quantities of
652 geospatial data (e.g., Deines et al., 2017; Gao et al., 2018; Cho et al., 2019; Yuan et al., 2020
653 and references therein). For example, Deines et al. (2017) used a random forest classifier to
654 identify irrigated areas in the High Plains, an arid to semi-arid region, based on high
655 resolution multi-spectral satellite imagery. In another study in a sub-humid area, a set of
656 novel input features were hand crafted to enhance the contrast between neighbouring rainfed
657 and irrigated areas, and these features enabled a random forest classifier to achieve
658 satisfactory performance in mapping irrigated areas (Xu et al., 2019). In climate sciences,
659 deep learning was recently applied to detection of extreme weather events such as tropical
18
660 cyclones, atmospheric rivers and weather fronts. Existing approaches to detecting such
661 extremes rely on human expertise and subjective detection thresholds. Using deep learning
662 architectures such as convolutional layers can overcome these limitations and has proven to
663 be promising (Liu et al., 2016; Racah et al., 2017; Kim et al., 2019).
664
665 Figure 7. Machine learning has been used in various hydrologic applications in stand-
666 alone mode or integrated with process-based modeling. Machine learning can process
667 multi-type data to identify hydrologic events and estimate variables (1), approximate
668 hydrologic processes and generate new knowledge regarding the processes (2), aid in
669 parameterization of process-based models, develop fast surrogates (4), and correct the
670 bias of process-based models (5). The current research frontier is to explore hybrid
671 modeling that integrates physical knowledge with machine learning to achieve
672 improved prediction accuracy and interpretability (5, 6) (Karpatne et al., 2019;
673 Reichstein et al., 2019). Arrows indicate information flow.
674
675 3.1.2. Estimating hydrologic variables
19
686 Estimation of precipitation is critical for climatic and hydrologic research.
687 PERSIANN and its variants are arguably the most successful machine learning-derived,
688 remote sensing-based precipitation estimates (Sorooshian et al., 2000; Ashouri et al., 2015;
689 Tao et al., 2016). Earlier versions of PERSIANN used the classical ANN to estimate
690 precipitation from satellite longwave infrared imagery. Recently, Tao et al. (2016) used a
691 stacked denoising autoencoder to improve estimation accuracy; the deep network was shown
692 as able to substantially alleviate bias and false alarms. A followup study combined
693 PERSIANN precipitation with LSTM to provide short-term precipitation forecast (Akbari
694 Asanjan et al., 2018). Motivated by the spatiotemporal correlation structure underlying the
695 precipitation field, the convolutional layer and LSTM architectures have been combined and
696 applied to precipitation nowcasting from radar data (Shi et al., 2015; Shi et al., 2017).
697 Conventional machine learning and deep learning methods have also been used for statistical
698 downscaling and merging spaceborne, ground-based, and rain gauge precipitation
699 measurements (Kleiber et al., 2012; Vandal et al., 2017; Chen et al., 2019; Pan et al., 2019).
700
701 Besides precipitation, machine learning methods have been used to estimate SWE
702 (Bair et al., 2018; Broxton et al., 2019), ET (e.g., Ke et al., 2016; Xu et al., 2018) and soil
703 moisture (e.g., Ahmad et al., 2010; Aboutalebi et al., 2019; Zhang et al., 2017; Lee et al.,
704 2019) from remote sensing and in situ measurements. For example, Bair et al. (2018)
705 estimated SWE in the watersheds of Afghanistan in real time using physiographic and remote
706 sensing data. Ke et al. (2016) used machine learning and 30-m resolution Landsat imagery to
707 downscale MODIS 1-km ET. Aboutalebi et al. (2019) estimated moisture content of different
708 soil layers from high-resolution UAV multi-spectral imagery and compared the performance
709 of genetic programming (a combination of an evolutionary algorithm and artificial
710 intelligence), ANN, and SVM. They found that the performance of machine learning
711 algorithms increases for deeper soils, and genetic programming achieved significantly higher
712 accuracy than SVM and ANN at the deepest validation point. In addition, genetic
713 programming outputs an equation that can be potentially transferred to other regions. At a
714 larger scale, Zhang et al. (2017) used deep learning to estimate soil moisture for all croplands
715 of China from Visible Infrared Imaging Radiometer Suite (VIIRS) raw data. Assessed using
716 in situ measurements, the estimated soil moisture was more accurate than the Soil Moisture
717 Active Passive (SMAP) active radar soil moisture and the Global Land Data Assimilation
718 System (GLDAS) products. Besides remotely sensed data, machine learning algorithms can
719 also be used to leverage in situ moisture measurements. For example, Andugula et al. (2017)
720 used GPR to upscale point-based soil moisture measurements from a dense sensor network.
721 In groundwater hydrology, there are emerging applications of machine learning.
722 Seyoum et al. (2019) estimates groundwater level anomaly by downscaling GRACE
723 Terrestrial Water Storage Anomaly (TWSA). Smith and Majumdar (2020) used random
724 forests to map land subsidence due to groundwater pumping based on ET, land use, and
725 sediment thickness. Various studies used conventional machine learning algorithms to map
726 groundwater potential (occurrence of springs) based on topographic, land use, and geologic
727 factors (e.g., Naghibi et al., 2017; Chen et al., 2019; Kordestani et al., 2019). The mapping
728 accuracy was found sensitive to the size of the training dataset (Moghaddam D.D. et al,
729 2020). Moghaddam M.A. et al. (2020) estimated the flux between a river and groundwater
730 from high frequency observations of subsurface pressure and temperature using CART and
731 gradient boosting.
732 In addition to the above studies, machine learning has been used in environmental
733 monitoring applications such as predicting recreational water quality advisories (Brooks et
20
734 al., 2016), estimating groundwater nitrate concentration (Nolan et al., 2015), and identifying
735 facilities likely to violate environmental regulations (Hino et al., 2018).
736
737 3.1.3. Approximating hydrologic processes
738 Some studies have used machine learning to emulate the dynamic processes
739 governing hydrologic variables (e.g., Torres-Rua et al., 2011; Fang et al.; 2017; Zhao et al.,
740 2019; Fang and Shen, 2020). Torres-Rua et al. (2011) used the relevance vector machine
741 algorithm to forecast daily PET under a condition with limited climate data. Zhao et al.
742 (2019) developed a physics-constrained ANN model to predict ET by embedding surface
743 energy conservation into the loss function. Fang et al. (2017) used an LSTM to reproduce
744 SMAP surface soil moisture content product over CONUS. An LSTM was trained using the
745 meteorological forcings and outputs from land surface models as inputs and the SMAP
746 product as target. The LSTM model was able to reproduce the soil moisture dynamics with
747 higher accuracy than regularized linear regression, autoregression, and a simple ANN.
748
749 Rainfall-runoff modeling and streamflow forecasting have profound implications for
750 water resources management and have been investigated for decades. Applications of
751 machine learning to rainfall-runoff modeling can be dated back to the 1990s (Buch et al.,
752 1993; Kang et al., 1993; Hsu et al., 1995; Smith and Eli, 1995). While the earliest
753 applications were focused on ANNs, later studies have employed a variety of conventional
754 machine learning algorithms (Yaseen et al., 2015 and references therein), such as SVM
755 (Asefa et al, 2006; Rasouli et al., 2012; Adnan et al., 2020), GPR (Rasouli et al., 2012),
756 multivariate adaptive regression splines (Adnan et al., 2020), and neural network-based
757 methods (Rasouli et al., 2012; Ren et al., 2018; Boucher et al., 2020). There is no consensus
758 on a single machine learning algorithm that outperforms others; in many applications they
759 achieved satisfactory results at various time and spatial scales and across different hydrologic
760 regimes.
761
762 Conventional machine learning algorithms, with the exception of autoregressive
763 models, do not have mechanisms to explicitly represent the temporal evolution of the
764 hydrologic processes. Therefore, applying conventional machine learning to rainfall-runoff
765 modeling requires hand-crafting a set of input features that encapsulate some “history” of the
766 watershed, such as lagged meteorological time series and accumulated precipitation over a
767 past period of time. Recently, there has been a growing interest in applying RNNs, LSTM in
768 particular, to rainfall-runoff modeling and streamflow forecasting because these deep
769 learning architectures can represent long-term dependencies (Kratzert et al., 2018; Kratzert et
770 al., 2019b; Jiang et al., 2020; Tenant et al., 2020). For example, Kratzert et al. (2018) used
771 LSTM to simulate daily streamflow using meteorological forcings including daily
772 precipitation, maximum and minimum temperature, shortwave downward radiation, and
773 humidity. It was shown for some watersheds that the LSTM used its cell memory to
774 approximate the watershed storage dynamics such as snow accumulation and melt within the
775 annual cycle. In addition, it was found that LSTM achieved overall good performance as a
776 regional model when it was trained using data from many catchments. When the regional
777 LSTM model was fine tuned for individual catchment separately, it outperformed a
778 commonly used hydrologic model (SAC-SMA combined with Snow-17) calibrated for
779 individual catchments in the CAMELS dataset. A follow-up study further investigated the
780 capability of LSTM as a regional model and modified the vanilla LSTM architecture to
781 embed catchment characteristics as static inputs in addition to time-varying meteorological
21
782 forcings (Kratzert et al., 2019). The resulting LSTM model outperformed several lumped and
783 distributed hydrological models. Besides rainfall-runoff modeling, LSTM has been used for
784 short-term flood forecasting with lead time of hours to days (e.g., Hu et al., 2019; Lv et al.,
785 2020; Xiang et al., 2020). For example, Hu et al. (2019) developed a spatio-temporal flood
786 forecasting framework where proper orthogonal decomposition and SVD were applied to
787 reduce the dimension of the large training data and the computational cost associated with
788 training and forward evaluation of the LSTM model. Ding et al. (2019) combined attention
789 mechanisms with LSTM; the resulting model outperformed LSTM without attention, SVM,
790 and ANN. Besides LSTM, other deep learning architectures such as autoencoders have also
791 been used for streamflow forecasting (Liu et al., 2017).
792
793 In the groundwater hydrology community, there is also a growing body of research
794 applying machine learning techniques. Some of these studies are focused on predicting
795 groundwater level from meteorological variables using conventional machine learning (Yoon
796 et al., 2011; Sahoo et al., 2017; Wunsch et al., 2018; Guzman et al., 2019) and deep learning
797 (Ghose et al., 2018; Zhang et al., 2018; Ma et al., 2020). Other studies have investigated the
798 potential of machine learning for groundwater flow simulation. Tartakovsky et al. (2020)
799 used fully connected DNNs for steady state saturated and unsaturated flow. The DNNs were
800 trained to approximate the hydraulic conductivity and spatially varying state variables (head
801 for saturated flow and pressure for unsaturated flow) with sparse observations. Physical
802 constraints were introduced by adding the residual of the governing equation (Darcy’s
803 Law/Richards equation) to the loss function. The approach was tested on synthetic case
804 studies and achieved satisfactory accuracy of simulating the head-conductivity relationships.
805 Wang et al. (2020) used a similar approach for transient saturated flow simulation and added
806 the residuals of both the governing partial differential equation (PDE) and boundary
807 conditions to the loss function. The physically constrained DNN yielded a more physically
808 feasible solution and lower generalization error than a DNN without these constraints.
809
810 3.1.4. Mining relationships among hydrologic variables for knowledge discovery
22
830 Physical process-based numerical models have long been the primary quantitative
831 tools in hydrologic sciences. Here we briefly review usage of machine learning integrated
832 with process-based modeling to facilitate or improve one or more components of the latter
833 (Fig. 7).
834
835 3.2.1. Parameterization
836 Most process-based models require specification of parameters. Often, the parameters
837 do not correspond to directly measurable quantities, or it is infeasible to measure these
838 quantities at the spatial resolution and scale required by the model. In recent years, deep
839 learning in particular has been used to estimate properties of geologic media, such as
840 permeability and diffusivity directly from micro-CT images of porous media (Kamrava et al.,
841 2019; Wu et al., 2018; Wu et al., 2019). For example, Wu et al. (2018) demonstrated the
842 utility of a physics-informed deep network for fast prediction of permeability directly from
843 images. They first generate images of synthetic porous media, and then performed lattice
844 Boltzmann simulations to calculate the permeability of each sample image. This resulted in a
845 dataset that was used to train a modified CNN. The convolutional layers extract latent
846 features from the image that could be relevant to permeability; an MLP then digests the
847 extracted features along with two physical parameters, porosity and specific surface area, to
848 estimate permeability. The physics-informed CNN achieved high test accuracy and
849 outperformed regular CNN without physical parameters. Because fluid dynamics simulations
850 such as lattice Boltzmann are computationally expensive, once trained the deep network can
851 greatly reduce the computational cost for predicting permeability of a new image.
852
853 GANs and autoencoder have been used in many studies for reconstruction of geologic
854 media, often in order to generate realizations for subsequent stochastic simulations in
855 subsurface hydrology (Laloy et al., 2017; Laloy et al., 2018) and related areas (Mosser et al.,
856 2018; Liu et al., 2019). Laloy et al. (2017) used the variational autoencoder to construct a
857 low-dimensional latent representation of complex binary geologic media with a relatively low
858 number of parameters, thus making it possible to perform time consuming Markov Chain
859 Monte Carlo (MCMC) sampling. The autoencoder outperformed the state-of-the-art inversion
860 technique using multi-point statistics and sequential geostatistics simulation. They noted,
861 however, that the variational autoencoder model requires several tens of thousands of training
862 images. A follow-up study (Laloy et al., 2018) used GANs to replace the variational
863 autoencoder in order to reduce training data needs and extend to multicategorical data
864 (geologic facies).
865
866 In surface hydrology, machine learning has been used for regionalization of rainfall-
867 runoff model parameters, which is an important step towards runoff prediction in ungauged
868 basins (Beck et al., 2016; Jiang et al., 2020). For example, Beck et al. (2016) developed
869 global maps of parameters for a simple conceptual rainfall-runoff model based on climatic
870 and physiographic factors, using a model trained on calibrated parameters from more than
871 1,700 catchments. A related line of research used streamflow signatures to delineate
872 catchments groups with distinct hydrological behaviors, wherein clustering analysis and
873 decision trees were used for this purpose (e.g., Sawicz et al., 2013; Toth, 2013; Boscarello et
874 al., 2016). Chaney et al. (2016) used random forest to develop probabilistic estimates of soil
875 properties at 30-m resolution for CONUS based on geospatial environmental covariates such
876 as distribution of uranium, thorium, and potassium.
877
878 3.2.2. Surrogate modeling
23
879 Recently, there has been increasing interest in the use of machine learning for
880 surrogate modeling in the context of optimization (Asefa et al., 2005; Cai et al., 2015; Wang
881 et al., 2014; Wu et al., 2015), and uncertainty quantification (Xu et al., 2017; Yang et al.,
882 2018; Zhang et al., 2020). Recent studies have also used deep learning for uncertainty
883 quantification (Hu et al., 2019; Laloy and Jacques, 2019; Mo et al., 2019a; 2019b). Many
884 process-based models, such as groundwater flow and solute transport models, are
885 computationally expensive, making it challenging to perform analyses that require running
886 the model for many times (Asher et al., 2015). Surrogate models emulate process-based
887 model simulation results as a function of inputs and parameters but run much faster. Machine
888 learning techniques are powerful tools to represent nonlinear functions and thus well
889 positioned for surrogate modeling. For example, Cai et al. (2015) used SVM to develop a fast
890 surrogate of a watershed simulation model (SWAT); the surrogate model was coupled with a
891 stochastic optimization model within a decision-support framework to assess the roles of
892 strategic measures and tactical measures in drought preparedness and mitigation under
893 different climate projections. Wu et al. (2015) used an adaptive approach, where the surrogate
894 model is adaptively refined during the search for optima. Xu et al. (2017) used random forest
895 and SVM to construct fast surrogates of a regional groundwater flow model for Bayesian
896 calibration. Mo et al. (2019a; 2019b) used a convolutional encoder-decoder architecture to
897 build surrogate models to facilitate groundwater contaminant source identification and
898 uncertainty quantification of a multiphase flow problem, respectively. Laloy et al. (2019)
899 compared three surrogate modeling techniques (GPR, polynomial chaos expansion, and
900 DNN) for sensitivity analysis and Bayesian calibration of a reactive transport model. DNN
901 achieved the best emulation accuracy even though the training set is relatively small (from 75
902 to 500 samples). However, the DNN surrogate model yielded the worst performance for the
903 calibration task and led to posterior distribution far away from the truth. A possible cause is
904 DNN overfitting the training data, resulting in small but biased prediction error with a
905 complex structure. In contrast, GPR-based surrogate model approximated the true posterior
906 well. The findings suggest the need for further investigation on quantification of uncertainty
907 introduced by surrogate modeling. Zhang et al. (2020) proposed an adaptive approach for
908 Bayesian calibration of a groundwater transport model. The approach adaptively refines the
909 surrogate, thus reducing surrogate error, as the posterior distribution is being approximated.
910
911 3.2.3. Bias correction
912 Process-based models are generally considered more reliable than machine learning-
913 based data-driven models for predictive tasks such as projection under climate change.
914 However, it has been recognized that process-based models may yield biased simulation
915 results due to biases in forcing data, incorrect parameters, and/or simplified or improper
916 conceptualization of the physical processes despite advances in understanding of hydrologic
917 processes and development of sophisticated model structures. Machine learning techniques
918 may be able to learn from observational data to recover information not represented by
919 process-based models. Because process-based and data-driven modeling have complementary
920 strengths, they can be combined to yield more accurate predictions. Conventional machine
921 learning techniques have proven effective in correcting the bias of surface (Abebe and Price,
922 2003; Solomatine and Shrestha, 2009; Pianosi et al., 2012; Evin et al., 2014 and references
923 therein) and subsurface hydrologic models (Demissie et al., 2009; Xu et al., 2014; Tyralis et
924 al., 2019). Recently, there is emerging research applying deep learning for bias correction.
925 Sun et al. (2019) used CNN to correct the mismatch between NOAH-simulated terrestrial
926 water storage anomaly (TWSA) and GRACE products. Nearing et al. (2020) used LSTM to
927 process the output of a calibrated conceptual rainfall-runoff model and achieved better
24
928 accuracy than using each of the models alone. Frame et al. (2020) applied a similar approach
929 to post-process the daily streamflow predictions of the National Water Model (NWM),
930 leading to substantial improvements. The LSTM performance increased when NWM states
931 and fluxes were added as inputs.
932
933 4. CHALLENGES AND OPPORTUNITIES
934 Possible degradation of generalization error and the lack of physical interpretability
935 and constraints had largely hindered the application of machine learning in hydrology and
936 other disciplines of geosciences in the past. Even with regularization strategies implemented,
937 a trained machine learning model may still generalize poorly. This issue is exacerbated by the
938 relatively small training dataset available in hydrologic applications as well as the need to
939 predict under nonstationary conditions such as those induced by climate change. Machine
940 learning may also fall short of predicting emerging patterns. In addition to temporal
941 nonstationarity, hydrologic applications are known to exhibit high degrees of spatial
942 heterogeneity. Most previous applications of machine learning in hydrology are limited to
943 one or a few test cases, and the machine learning models developed for a limited number of
944 sites are likely not transferable to other regions where training data is scarce. Although the
945 extrapolation problem exists even for process-based models, it is particularly acute for
946 machine learning methods partly because of their flexibility of adapting to a wide range of
947 functional relationships and lack of physical constraints.
948
949 A second major challenge lies in the lack of physical interpretability of machine
950 learning models. With few exceptions (e.g., Lasso, CART), most machine learning models
951 learn functional relationships that are very complicated to comprehend. It is usually difficult,
952 if at all possible, to draw physical understanding from the learned model. In addition to the
953 models themselves being hard to interpret, they may provide predictions that cannot be easily
954 understood, are implausible, and/or lack physical consistency. The lack of transparency raises
955 questions about the appropriateness of using machine learning models for decision making
956 that have high stakes.
957
958 Because of this and also given the importance of knowledge discovery in any
959 discipline of physical sciences, developing approaches to probe into these models and
960 inherently interpretable machine learning models is crucial. In recent years, there has been a
961 surge of work on the topic of “explainable AI” within the deep learning community (see
962 Gilpin et al., 2019; Rudin et al., 2019; Samek and Müller, 2019 and references therein). In the
963 hydrology community, interpreting deep learning models is also gaining attention (Shen,
964 2018; Ding et al. 2019; Kratzert et al., 2019a).
965
966 A current research frontier is to integrate knowledge about physical processes with
967 machine learning. Process-based modeling and data-driven modeling have complementary
968 strengths and weaknesses, and combining them in multiple ways provide exciting
969 opportunities to address the above-mentioned challenges. Karpatne et al. (2017) and
970 Reichstein et al. (2019) provide comprehensive recommendations on possible ways physical
971 knowledge and machine learning can be integrated. Here, we highlight a few integration
972 mechanisms that have proven to be promising in hydrologic applications. First, physical
973 knowledge can be incorporated as regularization terms in the loss function. In this way, the
974 learned model is forced to respect physical constraints such as mass and energy conservation
975 (de Bezenac, 2019; Jia et al., 2019; Tartakovsky et al., 2020; Zhao et al., 2019; Wang et al.,
976 2020). Second, a hybrid model can consist of a process-based component responsible for
25
977 physical processes that are well understood and a machine learning component dealing with
978 the less understood processes (Ren et al., 2018; Sun et al., 2019). In some cases, it may be
979 possible to encode the physical knowledge expressed as ordinary or partial differential
980 equations into the deep learning architecture (Jiang et al., 2020). When explicit encoding is
981 not possible, an alternative is to augment training data of the machine learning model with
982 simulation results generated by a process-based model (Jia et al., 2019). This provides two-
983 fold benefits: more training data and the potential to learn physical knowledge, potentially
984 related to predicting under nonstationary conditions, from the augmented training data. It has
985 been shown in some studies discussed above and reviewed in Section 3 that incorporating
986 physical knowledge improves the generalization performance of the machine learning model.
987
988 A third challenge arises from the limited training data in hydrologic applications.
989 Despite the fast-growing hydrologic data availability, data are still scarce in some
990 applications, especially when data are expensive or time-consuming to collect. For example,
991 there may be a limited amount of ground truth of the output variable (e.g., Deines et al., 2017;
992 Xu et al., 2019), or available training data may have imbalanced classes due to sampling bias
993 or the output variable of interest being a low probability event. In addition, information does
994 not necessarily increase linearly with data amount. For example, one year of streamflow
995 observations at 15-min interval (~35,040 data points) is likely insufficient to properly train a
996 machine learning model for rainfall-runoff modeling due to autocorrelation and the limited
997 range of the hydrologic regime the training data covers. The importance of the
998 “informativeness” of the data (Gupta et al., 1998) has been investigated in various studies
999 both theoretically (Gupta and Sorooshian, 1985) and empirically (Yapo et al., 1996;
1000 Boughton, 2007; Singh and Bárdossy, 2012). These studies provide valuable insights into
1001 determining the amount of data needed to train machine learning models in hydrologic
1002 context. Ayzel and Heistermann (2021) train deep learning-based rainfall-runoff models for
1003 six CAMELS watersheds using varying data length and found that deep learning models
1004 require longer data to calibrate than a conceptual hydrologic model, although their
1005 performance catches up quickly with increasing data length. Their findings suggest that in
1006 practice it may require less data to train the deep learning architectures than predicted by
1007 theoretical bounds of sample size established in deep learning literature (e.g., Gu et al., 2018).
1008 Problems associated with limited sample size may be alleviated by the above-mentioned data
1009 augmentation strategy and borrowing ideas from unsupervised learning, semi-supervised
1010 learning (Zhu and Goldberg, 2009; Kingma et al., 2014; Ding et al., 2018) or active learning
1011 (Settles, 2011) to utilize available data more efficiently (Racah et al., 2017; Karpatne et al.,
1012 2019).
1013
1014 Related to the problem of small sample size is the juxtaposition of multi-source,
1015 multi-type, multi-scale data with various accuracy. Machine learning algorithms do not have
1016 a mechanism to explicitly account for such data heterogeneity. This can be justified by the
1017 homogeneity of data involved in typical machine learning and deep learning applications
1018 (e.g., a dataset of images or sentences). In contrast, hydrologic applications often encounter
1019 variables with different physical meaning, data representative at various scales (e.g., point-
1020 based ground stations, satellite imagery at different resolutions and sampling frequency), and
1021 noisy observations. In addition, measurements may contain bias and complex error structure
1022 that violate the commonly used white noise assumption. When these data are used as inputs
1023 and training targets, the data heterogeneity will likely affect the learning outcome. Therefore,
1024 there is a need to develop methods to handle data heterogeneity and uncertainty via, for
1025 example, carefully designed experiments (such as artificial perturbation to training data) or an
1026 explicit probabilistic formulation of machine learning algorithms (Ghahramani, 2015; Gal
26
1027 and Ghahramani 2016). Appropriately representing and propagating uncertainty is also
1028 crucial for the robustness of predictions provided by the machine learning models particularly
1029 when they are trained with limited data and/or used under nonstationary conditions.
1030
1031 Despite the reported successes, most of the studies reviewed in Section 3 are isolated
1032 applications of machine learning towards a specific problem. Often, deep learning
1033 architectures that have been tested and proven successful within the deep learning community
1034 need some tailoring before they can be applied to hydrologic problems. This is because a
1035 hydrologic application may not be directly mapped to a classical deep learning task for which
1036 these architectures have been established. For example, LSTMs have achieved great success
1037 for translating sentences from one language to another. A sentence differs from the time
1038 series of a hydrologic variable, and this difference affects the design of the deep learning
1039 architecture as well as data storage management practices. Often, identifying the appropriate
1040 architecture for a specific application requires substantial efforts involving trial-and-error,
1041 leading to a suboptimal choice. This difficulty partially counteracts the benefit deep learning
1042 offers in terms of avoiding feature engineering required by conventional machine learning
1043 methods. Bridging this disciplinary gap calls for formulation of hydrologic problems as
1044 “standard” machine learning tasks furnished with catered benchmark datasets.
1045
1046 5. CONCLUDING REMARKS
1047 The recently revived interest within the hydrology community in machine learning in
1048 general and deep learning in particular is likely to continue given the hydrologic data deluge.
1049 The enormous amount of data poses challenges to traditional knowledge-driven reasoning
1050 and provides exciting opportunities for machine learning-based data-driven reasoning. In this
1051 overview, we attempted to provide a comprehensive, although far from complete, discussion
1052 of recent successful stories of applying machine learning as a stand-alone model or
1053 complementary to process-based modeling efforts. Existing studies demonstrate the potential
1054 of machine learning techniques for filling the gaps of physical process understanding,
1055 generating more accurate predictions, enabling inverse modeling, and discovering new
1056 knowledge. Several primary challenges are identified in using machine learning for
1057 prediction under nonstationary conditions, developing interpretable machine learning models,
1058 ensuring physical consistency, training with limited sample size, and characterizing and
1059 propagating uncertainty. Meanwhile, there is emerging research that aims at integrating
1060 physical knowledge with machine learning to address some of the above challenges.
1061
1062
1063 We argue that there is a need to develop formulations of representative hydrologic
1064 problems with quality-controlled benchmark datasets. These formulations can be related to
1065 one or more standard machine learning tasks that have been extensively studied, so that the
1066 advances in the machine learning and other fields can be leveraged to identify the best
1067 strategy to tackle the hydrologic problem. For example, forecasting of a hydrologic variable
1068 may be formulated as the problem of estimating the expected value (deterministic) or
1069 probability density function (probabilistic) of the variable of the next 𝑘 time steps
1070 conditioned on historical measurements of itself and explanatory variables as well as
1071 forecasts of the explanatory variables. Depending on how the variables are resolved spatially,
1072 each variable can be gridded or time series data. Such formulations will facilitate
1073 development of general-purpose architectures suitable for representative types of hydrologic
1074 applications as well as identifying similar problem formulations from other fields of
1075 geosciences. Data from isolated applications that fall within the same problem formulation
27
1076 can be compiled and quality controlled to create benchmark datasets that are much larger than
1077 data used in a single application. The benchmark datasets will serve as a venue for
1078 assessment and intercomparison of various machine learning models in terms of prediction
1079 capability, physical feasibility, and interpretability. Achieving this requires collective efforts
1080 within the hydrology community as well as interdisciplinary collaboration with the machine
1081 learning and geosciences communities.
1082
1083 Data Availability Statement
1084 Data sharing is not applicable to this article as no new data were created or analyzed in this
1085 study.
1086
1087 Funding Information
1088 T. Xu was supported by NOAA COM Grant NA20OAR4310341 and NSF Grant OAC-
1089 1931297 as well as funding provided by the School of Sustainable Engineering and the Built
1090 Environment, Ira A. Fulton Schools of Engineering, Arizona State University.
1091
1092 Acknowledgments
1093 The authors thank Dr. Ruijie Zeng (Arizona State University) for comments on an earlier
1094 version of this manuscript and Qianqiu Longyang and Ruoyao Ou for their contributions to
1095 the visualizations. The authors claim no conflict of interest.
1096
1097 References
1098 Abebe, A., and R. Price. (2003). Managing uncertainty in hydrological models using complementary
1099 models, Hydrol. Sci. J., 48(5), 679–692.
1100 Aboutalebi, M., Allen, L. N., Torres-Rua, A. F., McKee, M., & Coopmans, C. (2019). Estimation of soil
1101 moisture at different soil levels using machine learning techniques and unmanned aerial vehicle
1102 (UAV) multispectral imagery. In Autonomous Air and Ground Sensing Systems for Agricultural
1103 Optimization and Phenotyping IV (Vol. 11008, p. 110080S). International Society for Optics and
1104 Photonics.
1105 Adnan, R. M., Liang, Z., Heddam, S., Zounemat-Kermani, M., Kisi, O., & Li, B. (2020). Least square
1106 support vector machine and multivariate adaptive regression splines for streamflow prediction in
1107 mountainous basin using hydro-meteorological data as inputs. Journal of Hydrology, 586, 124371.
1108 Ahmad, S., Kalra, A., & Stephen, H. (2010). Estimating soil moisture using remote sensing data: A
1109 machine learning approach. Advances in Water Resources, 33(1), 69–80.
1110 https://doi.org/10.1016/J.ADVWATRES.2009.10.008
1111 Akbari Asanjan, A., Yang, T., Hsu, K., Sorooshian, S., Lin, J., & Peng, Q. (2018). Short‐term
1112 precipitation forecast based on the PERSIANN system and LSTM recurrent neural networks.
1113 Journal of Geophysical Research: Atmospheres, 123(22), 12-543.
1114 Anda, A., Simon, B., Soós, G., Menyhárt, L., da Silva, J. A. T., & Kucserka, T. (2018). Extending
1115 Class A pan evaporation for a shallow lake to simulate the impact of littoral sediment and
1116 submerged macrophytes: a case study for Keszthely Bay (Lake Balaton, Hungary). Agricultural
1117 and forest meteorology, 250, 277-289.
1118 Andugula, P., Durbha, S. S., Lokhande, A., & Suradhaniwar, S. (2017). Gaussian process based
1119 spatial modeling of soil moisture for dense soil moisture sensing network. In 2017 6th
1120 International Conference on Agro-Geoinformatics (pp. 1-5). IEEE.
28
1121 Asefa, T., Kemblowski, M., McKee, M., & Khalil, A. (2006). Multi-time scale stream flow predictions:
1122 The support vector machines approach. Journal of Hydrology, 318(1–4), 7–16.
1123 https://doi.org/10.1016/J.JHYDROL.2005.06.001
1124 Asefa, T., Kemblowski, M., Urroz, G., McKee, M., 2005. Support vector machines (SVMs) for
1125 monitoring network design. Ground Water 43 (3), 413–422.
1126 Asher, M. J., Croke, B. F., Jakeman, A. J., & Peeters, L. J. (2015). A review of surrogate models and
1127 their application to groundwater modeling. Water Resources Research, 51(8), 5957-5973.
1128 Ashouri, H., Hsu, K. L., Sorooshian, S., Braithwaite, D. K., Knapp, K. R., Cecil, L. D., ... & Prat, O. P.
1129 (2015). PERSIANN-CDR: Daily precipitation climate data record from multisatellite observations
1130 for hydrological and climate studies. Bulletin of the American Meteorological Society, 96(1), 69-
1131 83.
1132 Ayzel, G., & Heistermann, M. (2021). The effect of calibration data length on the performance of a
1133 conceptual hydrological model versus LSTM and GRU: A case study for six basins from the
1134 CAMELS dataset. Computers & Geosciences, 104708.
1135 Bair, E. H., Abreu Calfa, A., Rittger, K., & Dozier, J. (2018). Using machine learning for real-time
1136 estimates of snow water equivalent in the watersheds of Afghanistan. The Cryosphere, 12(5),
1137 1579-1594.
1138 Beck, H. E., van Dijk, A. I., De Roo, A., Miralles, D. G., McVicar, T. R., Schellekens, J., & Bruijnzeel,
1139 L. A. (2016). Global‐scale regionalization of hydrologic model parameters. Water Resources
1140 Research, 52(5), 3599-3622.
1141 Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new
1142 perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798-1828.
1143 Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep
1144 networks. In Advances in neural information processing systems (pp. 153-160).
1145 Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent
1146 is difficult. IEEE transactions on neural networks, 5(2), 157-166.
1147 Blanchet, F. G., Legendre, P., & Borcard, D. (2008). Forward selection of explanatory
1148 variables. Ecology, 89(9), 2623-2632.
1149 Boscarello, L., Ravazzani, G., Cislaghi, A., & Mancini, M. (2016). Regionalization of flow-duration
1150 curves through catchment classification with streamflow signatures and physiographic–climate
1151 indices. Journal of Hydrologic Engineering, 21(3), 05015027.
1152 Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of
1153 COMPSTAT'2010 (pp. 177-186). Physica-Verlag HD.
1154 Boucher, M.-A., Quilty, J., & Adamowski, J. (2020). Data assimilation for streamflow forecasting using
1155 extreme learning machines and multilayer perceptrons. Water Resources Research, 56(6),
1156 e2019WR026226.
1157 Boughton, W. C. (2007). Effect of data length on rainfall–runoff modeling. Environmental Modeling &
1158 Software, 22(3), 406-413.
1159 Breiman, Leo, 2001. Random forests. Mach. Learn. 45 (1), 5–32.
1160 Brooks, W., Corsi, S., Fienen, M., & Carvin, R. (2016). Predicting recreational water quality advisories:
1161 A comparison of statistical methods. Environmental Modeling & Software, 76, 81-94.
1162 Broxton, P. D., Van Leeuwen, W. J., & Biederman, J. A. (2019). Improving snow water equivalent
1163 maps with machine learning of snow survey and lidar measurements. Water Resources
1164 Research, 55(5), 3739-3757.
29
1165 Buch, A. M., Mazumdar, H. S., & Pandey, P. C. (1993). Application of artificial neural networks in
1166 hydrological modeling: a case study of runoff simulation of a Himalayan glacier basin. In
1167 Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan)
1168 (Vol. 1, pp. 971-974). IEEE.
1169 Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: understanding AIC and BIC in
1170 model selection. Sociological methods & research, 33(2), 261-304.
1171 Cai, X., Zeng, R., Kang, W. H., Song, J., & Valocchi, A. J. (2015). Strategic planning for drought
1172 mitigation under climate change. Journal of Water Resources Planning and Management, 141(9),
1173 04015004.
1174 Chaney, N. W., Wood, E. F., McBratney, A. B., Hempel, J. W., Nauman, T. W., Brungard, C. W., &
1175 Odgers, N. P. (2016). POLARIS: A 30-meter probabilistic soil series map of the contiguous United
1176 States. Geoderma, 274, 54-67.
1177 Chaudhari, S., Mithal, V., Polatkan, G., & Ramanath, R. (2020). An Attentive Survey of Attention
1178 Models. J. ACM, 37, 4 (111).
1179 Chen, H., Chandrasekar, V., Tan, H., & Cifelli, R. (2019). Rainfall estimation from ground radar and
1180 TRMM precipitation radar using hybrid deep neural networks. Geophysical Research Letters,
1181 46(17-18), 10669-10678.
1182 Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the
1183 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-
1184 794).
1185 Chen, W., Tsangaratos, P., Ilia, I., Duan, Z., & Chen, X. (2019). Groundwater spring potential
1186 mapping using population-based evolutionary algorithms and data mining methods. Science of
1187 The Total Environment, 684, 31-49.
1188 Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., ... &
1189 Greene, C. S. (2018). Opportunities and obstacles for deep learning in biology and
1190 medicine. Journal of The Royal Society Interface, 15(141), 20170387.
1191 Cho, E., Jacobs, J. M., Jia, X., & Kraatz, S. (2019). Identifying Subsurface Drainage using Satellite
1192 Big Data and Machine Learning via Google Earth Engine. Water Resources Research, 55(10),
1193 8028-8045.
1194 Ciregan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image
1195 classification. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3642-
1196 3649). IEEE.
1197 Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature
1198 learning. In Proceedings of the fourteenth international conference on artificial intelligence and
1199 statistics (pp. 215-223).
1200 Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., & Bharath, A. A. (2018).
1201 Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1), 53-65.
1204 de Bezenac, E., Pajot, A., & Gallinari, P. (2019). Deep learning for physical processes: Incorporating
1205 prior scientific knowledge. Journal of Statistical Mechanics: Theory and Experiment, 2019(12),
1206 124009.
1207 de Oliveira, J. V., & Pedrycz, W. (Eds.). (2007). Advances in fuzzy clustering and its applications.
1208 John Wiley & Sons.
30
1209 Deines, J. M., Kendall, A. D., & Hyndman, D. W. (2017). Annual irrigation dynamics in the US
1210 Northern High Plains derived from Landsat satellite data. Geophysical Research Letters, 44(18),
1211 9350-9360.
1212 Demissie, Y. K., Valocchi, A. J., Minsker, B. S., & Bailey, B. A. (2009). Integrating a calibrated
1213 groundwater flow model with error-correcting data-driven models to improve predictions. Journal
1214 of hydrology, 364(3-4), 257-271.
1215 Ding, Y., Wang, L., Fan, D., & Gong, B. (2018, March). A semi-supervised two-stage approach to
1216 learning from noisy labels. In 2018 IEEE Winter Conference on Applications of Computer Vision
1217 (WACV) (pp. 1215-1224). IEEE.
1218 Ding, Y., Zhu, Y., Wu, Y., Jun, F., & Cheng, Z. (2019). Spatio-Temporal Attention LSTM Model for
1219 Flood Forecasting. In 2019 International Conference on Internet of Things (iThings) and IEEE
1220 Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social
1221 Computing (CPSCom) and IEEE Smart Data (SmartData) (pp. 458-465). IEEE.
1222 Du, S. S., Wang, Y., Zhai, X., Balakrishnan, S., Salakhutdinov, R., & Singh, A. (2018). How many
1223 samples are needed to estimate a convolutional or recurrent neural network?. arXiv preprint
1224 arXiv:1805.07883.
1225 Elkahky, A. M., Song, Y., & He, X. (2015). A multi-view deep learning approach for cross domain user
1226 modeling in recommendation systems. In Proceedings of the 24th International Conference on
1227 World Wide Web (pp. 278-288).
1228 Evin, G., Thyer, M., Kavetski, D., McInerney, D., & Kuczera, G. (2014). Comparison of joint versus
1229 postprocessor approaches for hydrological uncertainty estimation accounting for error
1230 autocorrelation and heteroscedasticity. Water Resources Research, 50(3), 2350-2375.
1231 doi:10.1002/ 2013WR014185
1232 Fang, K., Shen, C., Kifer, D., & Yang, X. (2017). Prolongation of SMAP to spatio-temporally seamless
1233 coverage of continental US using a deep learning neural network. Geophysical Research Letters,
1234 44, 11,030–11,039. https://doi.org/10.1002/2017GL075619
1235 Fang, K., & Shen, C. (2020). Near-real-time forecast of satellite-based soil moisture using long short-
1236 term memory with an adaptive data integration kernel. Journal of Hydrometeorology, 21(3), 399-
1237 413.
1238 Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of
1239 classifiers to solve real world classification problems?. The journal of machine learning
1240 research, 15(1), 3133-3181.
1241 Fernández-Delgado, M., Sirsat, M. S., Cernadas, E., Alawadi, S., Barro, S., & Febrero-Bande, M.
1242 (2019). An extensive experimental survey of regression methods. Neural Networks, 111, 11-34.
1243 Frame, J., Nearing, G., Kratzert, F., & Rahman, M. (2020). Post processing the US National Water
1244 Model with a Long Short-Term Memory network. https://doi.org/10.31223/osf.io/4xhac
1245 Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of
1246 statistics, 1189-1232.
1247 Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model
1248 uncertainty in deep learning. In international conference on machine learning (pp. 1050-1059).
1249 Gao, Q., Zribi, M., Escorihuela, M. J., Baghdadi, N., & Segui, P. Q. (2018). Irrigation mapping using
1250 Sentinel-1 time series at field scale. Remote Sensing, 10(9), 1495.
1251 George, E. I. (2000). The variable selection problem. Journal of the American Statistical
1252 Association, 95(452), 1304-1308.
1253 Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts,
1254 tools, and techniques to build intelligent systems. O'Reilly Media.
31
1255 Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence. Nature, 521(7553),
1256 452-459.
1257 Ghose, D., Das, U., & Roy, P. (2018). Modeling response of runoff and evapotranspiration for
1258 predicting water table depth in arid region using dynamic recurrent neural network. Groundwater
1259 for Sustainable Development, 6, 263-269.
1260 Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., & Kagal, L. (2018). Explaining explanations:
1261 An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on
1262 data science and advanced analytics (DSAA) (pp. 80-89). IEEE.
1263 Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural
1264 networks. In Proceedings of the thirteenth international conference on artificial intelligence and
1265 statistics (pp. 249-256).
1266 Glorot, X., Bordes, A. & Bengio. Y. (2011). Deep sparse rectifier neural networks. In Proceedings of
1267 the fourteenth international conference on artificial intelligence and statistics (pp. 315-323).
1268 Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y.
1269 (2014). Generative adversarial nets. In Advances in neural information processing systems (pp.
1270 2672-2680).
1271 Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
1272 Goodwell, A. E., & Kumar, P. (2017). Temporal information partitioning: Characterizing synergy,
1273 uniqueness, and redundancy in interacting environmental variables. Water Resources Research,
1274 53(7), 5920-5942.
1275 Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth
1276 Engine: Planetary-scale geospatial analysis for everyone. Remote sensing of Environment, 202,
1277 18-27.
1278 Graves, A. (2012). Long short-term memory. In Supervised sequence labelling with recurrent neural
1279 networks (pp. 37-45). Springer, Berlin, Heidelberg.
1280 Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A
1281 search space odyssey. IEEE transactions on neural networks and learning systems, 28(10),
1282 2222-2232.
1283 Gupta, V. K., & Sorooshian, S. (1985). The relationship between data and the precision of parameter
1284 estimates of hydrologic models. Journal of Hydrology, 81(1-2), 57-77.
1285 Gupta, H. V., Sorooshian, S., & Yapo, P. O. (1998). Toward improved calibration of hydrologic
1286 models: Multiple and noncommensurable measures of information. Water Resources
1287 Research, 34(4), 751-763.
1288 Gusyev, M. A., Haitjema, H. M., Carlson, C. P., & Gonzalez, M. A. (2013). Use of nested flow models
1289 and interpolation techniques for science‐based management of the Sheyenne National
1290 Grassland, North Dakota, USA. Groundwater, 51(3), 414-420.
1291 Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine
1292 learning research, 3(Mar), 1157-1182.
1293 Guzman, S. M., Paz, J. O., Tagert, M. L. M., & Mercer, A. E. (2019). Evaluation of seasonally
1294 classified inputs for the prediction of daily groundwater levels: NARX networks vs support vector
1295 machines. Environmental Modeling & Assessment, 24(2), 223-234.
1296 Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of
1297 the royal statistical society. series c (applied statistics), 28(1), 100-108.
1298 Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining,
1299 inference, and prediction, 2nd Ed. Springer Science & Business Media.
32
1300 Hino, M., Benami, E., & Brooks, N. (2018). Machine learning for environmental monitoring. Nature
1301 Sustainability, 1(10), 583-588.
1302 Hipsey, M. R., Hamilton, D. P., Hanson, P. C., Carey, C. C., Coletti, J. Z., Read, J. S., ... & Brookes, J.
1303 D. (2015). Predicting the resilience and recovery of aquatic systems: A framework for model
1304 evolution within environmental observatories. Water Resources Research, 51(9), 7023-7043.
1305 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-
1306 1780.
1307 Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and
1308 problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based
1309 Systems, 6(02), 107-116.
1310 Hofmann, T., Schölkopf, B., & Smola, A. J. (2008). Kernel methods in machine learning. The annals
1311 of statistics, 1171-1220.
1312 Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398).
1313 John Wiley & Sons.
1314 Hsu, K. L., Gupta, H. V., & Sorooshian, S. (1995). Artificial neural network modeling of the rainfall‐
1315 runoff process. Water resources research, 31(10), 2517-2530.
1316 Hu, R., Fang, F., Pain, C. C., & Navon, I. M. (2019). Rapid spatio-temporal flood prediction and
1317 uncertainty quantification using a deep learning method. Journal of Hydrology, 575, 911-920.
1318 Hu, Y., Quinn, C. J., Cai, X., & Garfinkle, N. W. (2017). Combining human and machine intelligence to
1319 derive agents’ behavioral rules for groundwater irrigation. Advances in water resources, 109, 29-
1320 40.
1321 Hutter, F., Kotthoff, L., & Vanschoren, J. (2019). Automated machine learning: methods, systems,
1322 challenges (p. 219). Springer Nature.
1323 Irani, J., Pise, N., & Phatak, M. (2016). Clustering techniques and the similarity measures used in
1324 clustering: a survey. International journal of computer applications, 134(7), 9-14.
1325 Izenman, A. J. (2013). Linear discriminant analysis. In Modern multivariate statistical techniques (pp.
1326 237-280). Springer, New York, NY.
1327 Jia, X., Willard, J., Karpatne, A., Read, J., Zwart, J., Steinbach, M., & Kumar, V. (2019). Physics
1328 guided RNNs for modeling dynamical systems: A case study in simulating lake temperature
1329 profiles. In Proceedings of the 2019 SIAM International Conference on Data Mining (pp. 558-566).
1330 Society for Industrial and Applied Mathematics.
1331 Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), 651-
1332 666.
1333 Jiang, S., Zheng, Y., & Solomatine, D. (2020). Improving AI system awareness of geoscience
1334 knowledge: Symbiotic integration of physical approaches and deep learning. Geophysical
1335 Research Letters, 47(13), e2020GL088229.
1336 Khan, S., & Yairi, T. (2018). A review on the application of deep learning in system health
1337 management. Mechanical Systems and Signal Processing, 107, 241-265.
1338 Kamrava, S., Tahmasebi, P., & Sahimi, M. (2020). Linking morphology of porous media to their
1339 macroscopic permeability by deep learning. Transport in Porous Media, 131(2), 427-448.
1340 Kang, K. W., Park, C. Y., & Kim, J. H. (1993). Neural network and its application to rainfall-runoff
1341 forecasting. Korean Journal of Hydrosciences, 4, 1-9.
1342 Kaplan, A., & Haenlein, M. (2019). Siri, Siri, in my hand: Who’s the fairest in the land? On the
1343 interpretations, illustrations, and implications of artificial intelligence. Business Horizons, 62(1),
1344 15-25.
33
1345 Karpatne, A., Ebert-Uphoff, I., Ravela, S., Babaie, H. A., & Kumar, V. (2019). Machine learning for the
1346 geosciences: Challenges and opportunities. IEEE Transactions on Knowledge and Data
1347 Engineering, 31(8), 1544-1554.
1348 Ke, Y., Im, J., Park, S., & Gong, H. (2016). Downscaling of MODIS One kilometer evapotranspiration
1349 using Landsat-8 data and machine learning approaches. Remote Sensing, 8(3), 215.
1350 Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly
1351 efficient gradient boosting decision tree. In Advances in neural information processing systems
1352 (pp. 3146-3154).
1353 Kim, S., Kim, H., Lee, J., Yoon, S., Kahou, S. E., Kashinath, K., & Prabhat, M. (2019, January). Deep-
1354 hurricane-tracker: Tracking and forecasting extreme climate events. In 2019 IEEE Winter
1355 Conference on Applications of Computer Vision (WACV) (pp. 1761-1769). IEEE.
1356 Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International
1357 Conference on Learning Representations (ICLR).
1358 Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with
1359 deep generative models. In Advances in neural information processing systems (pp. 3581-3589).
1360 Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on
1361 Learning Representations (ICLR).
1362 Kleiber, W., Katz, R. W., and Rajagopalan, B. (2012), Daily spatiotemporal precipitation simulation
1363 using latent and transformed Gaussian processes, Water Resour. Res., 48, W01523,
1364 doi:10.1029/2011WR011105.
1365 Kordestani, M. D., Naghibi, S. A., Hashemi, H., Ahmadi, K., Kalantar, B., & Pradhan, B. (2019).
1366 Groundwater potential mapping using a novel data-mining ensemble model. Hydrogeol. J, 27(1),
1367 211-224. https://doi.org/10.1007/s10040-018-1848-5.
1368 Kratzert, F., Klotz, D., Brenner, C., Schulz, K., & Herrnegger, M. (2018). Rainfall–runoff modeling
1369 using long short-term memory (LSTM) networks. Hydrology and Earth System Sciences, 22(11),
1370 6005-6022.
1371 Kratzert, F., Herrnegger, M., Klotz, D., Hochreiter, S., & Klambauer, G. (2019a). NeuralHydrology–
1372 Interpreting LSTMs in Hydrology. In Explainable AI: Interpreting, Explaining and Visualizing Deep
1373 Learning (pp. 347-362). Springer, Cham.
1374 Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., & Nearing, G. (2019b). Towards
1375 learning universal, regional, and local hydrological behaviors via machine learning applied to
1376 large-sample datasets. Hydrology & Earth System Sciences, 23(12).
1377 Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional
1378 neural networks. In Advances in neural information processing systems (pp. 1097-1105).
1379 Kumar, P. (2015). Hydrocomplexity: Addressing water security and emergent environmental risks.
1380 Water Resources Research, 51(7), 5827-5838.
1381 Laloy, E., Hérault, R., Lee, J., Jacques, D., & Linde, N. (2017). Inversion using a new low-dimensional
1382 representation of complex binary geological media based on a deep neural network. Advances in
1383 Water Resources, 110, 387-405.
1384 Laloy, E., Hérault, R., Jacques, D., & Linde, N. (2018). Training‐image based geostatistical inversion
1385 using a spatial generative adversarial neural network. Water Resources Research, 54(1), 381-
1386 406.
1387 Laloy, E., & Jacques, D. (2019). Emulation of CPU-demanding reactive transport models: a
1388 comparison of Gaussian processes, polynomial chaos expansion, and deep neural
1389 networks. Computational Geosciences, 23(5), 1193-1215.
34
1390 Laloy, E., Linde, N., Ruffino, C., Hérault, R., Gasso, G., & Jacques, D. (2019). Gradient-based
1391 deterministic inversion of geophysical data with generative adversarial networks: Is it feasible?.
1392 Computers & Geosciences, 133, 104333.
1393 LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel, L. D.
1394 (1990). Handwritten digit recognition with a back-propagation network. In Advances in neural
1395 information processing systems (pp. 396-404).
1396 LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document
1397 recognition, Proceedings of the IEEE, 86(11), 2278–2324.
1398 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444.
1399 Lee, C.suk, Sohn, E., Park, J.D., Jang, J.-.D., 2019. Estimation of soil moisture using deep learning
1400 based on satellite data: a case study of South Korea. GISci. Remote Sens. 56, 43–67.
1401 https://doi.org/10.1080/15481603.2018.1489943.
1402 Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic
1403 optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge
1404 discovery and data mining (pp. 661-670).
1405 Liang, F., Mao, K., Liao, M., Mukherjee, S., & West, M. (2007). Nonparametric Bayesian kernel
1406 models. Department of Statistical Science, Duke University, Discussion Paper, 07-10.
1407 Liang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, J. O. (2008). Mixtures of g priors for Bayesian
1408 variable selection. Journal of the American Statistical Association, 103(481), 410-423.
1409 Liu, H., Ong, Y. S., Shen, X., & Cai, J. (2020). When Gaussian process meets big data: A review of
1410 scalable GPs. IEEE transactions on neural networks and learning systems, 31(11), 4405-4423.
1411 Liu, Y., Racah, E., Correa, J., Khosrowshahi, A., Lavers, D., Kunkel, K., ... & Collins, W. (2016).
1412 Application of deep convolutional neural networks for detecting extreme weather in climate
1413 datasets. arXiv preprint arXiv:1605.01156.
1414 Liu, F., Xu, F., & Yang, S. (2017). A flood forecasting model based on deep learning algorithm via
1415 integrating stacked autoencoders with BP neural network. In 2017 IEEE third International
1416 conference on multimedia big data (BigMM) (pp. 58-61). IEEE.
1417 Liu, S., Zhong, Z., Takbiri-Borujeni, A., Kazemi, M., Fu, Q., & Yang, Y. (2019). A case study on
1418 homogeneous and heterogeneous reservoir porous media reconstruction by using generative
1419 adversarial networks. Energy Procedia, 158, 6164-6169.
1420 Lv, N., Liang, X., Chen, C., Zhou, Y., Li, J., Wei, H., & Wang, H. (2020). A Long Short-Term Memory
1421 Cyclic model With Mutual Information For Hydrology Forecasting: A Case Study in the Xixian
1422 Basin. Advances in Water Resources, 103622.
1423 Ma, Y., Montzka, C., Bayat, B., & Kollet, S. (2020). Using Long Short-Term Memory networks to
1424 connect water table depth anomalies to precipitation anomalies over Europe. Hydrology and Earth
1425 System Sciences Discussions, 1-30.
1426 MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In
1427 Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1,
1428 No. 14, pp. 281-297).
1429 Mater, A. C., & Coote, M. L. (2019). Deep learning in chemistry. Journal of chemical information and
1430 modeling, 59(6), 2545-2559.
1431 Meempatta, L., Webb, A. J., Horne, A. C., Keogh, L. A., Loch, A., & Stewardson, M. J. (2019).
1432 Reviewing the decision‐making behavior of irrigators. Wiley Interdisciplinary Reviews: Water, 6(5),
1433 e1366.
35
1434 Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7(Jun),
1435 983-999.
1436 Mishkin, D., & Matas, J. (2015). All you need is a good init. arXiv preprint arXiv:1511.06422.
1437 Mitchell, M. (1997). Machine learning. Burr Ridge, IL: McGraw Hill, 45(37), 870-877.
1438 Mo, S., Zabaras, N., Shi, X., Wu, J. (2019a). Deep autoregressive neural networks for high ‐
1439 dimensional inverse problems in groundwater contaminant source identification. Water Resources
1440 Research, 55(5), 3856-3881. https://doi.org/10.1029/2018WR024638.
1441 Mo, S., Zhu, Y., Zabaras, N. J., Shi, X., & Wu, J. (2019b). Deep convolutional encoder-decoder
1442 networks for uncertainty quantification of dynamic multiphase flow in heterogeneous media. Water
1443 Resources Research, 55(1), 703–728. https://doi.org/10.1029/2018WR023528.
1444 Moghaddam, D. D., Rahmati, O., Panahi, M., Tiefenbacher, J., Darabi, H., Haghizadeh, A., ... & Bui,
1445 D. T. (2020). The effect of sample size on different machine learning models for groundwater
1446 potential mapping in mountain bedrock aquifers. Catena, 187, 104421.
1447 Moghaddam, M. A., Ferre, P. A., Chen, X., Chen, K., Song, X., & Hammond, G. E. (2020). Applying
1448 Simple Machine Learning Tools to Infer Streambed Flux from Subsurface Pressure and
1449 Temperature Observations.
1450 Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: which
1451 algorithms implement Ward’s criterion?. Journal of classification, 31(3), 274-295.
1452 Naghibi, S. A., Ahmadi, K., & Daneshi, A. (2017). Application of support vector machine, random
1453 forest, and genetic algorithm optimized random forest models in groundwater potential mapping.
1454 Water Resources Management, 31(9), 2761-2775.
1455 Nearing, G., Sampson, A. K., Kratzert, F., & Frame, J. (2020). Post-processing a Conceptual Rainfall-
1456 runoff Model with an LSTM. https://doi.org/10.31223/osf.io/53te4.
1457 Nolan, B. T., Fienen, M. N., & Lorenz, D. L. (2015). A statistical learning framework for groundwater
1458 nitrate models of the Central Valley, California, USA. Journal of Hydrology, 531, 902-911.
1459 Pan, B., Hsu, K., AghaKouchak, A., & Sorooshian, S. (2019). Improving precipitation estimation using
1460 convolutional neural network. Water Resources Research, 55, 2301–2321.
1461 https://doi.org/10.1029/2018WR024090
1462 Pande, S., & Sivapalan, M. (2017). Progress in socio‐hydrology: A meta‐analysis of challenges and
1463 opportunities. Wiley Interdisciplinary Reviews: Water, 4(4), e1193.
1464 Phan, N., Dou, D., Wang, H., Kil, D., & Piniewski, B. (2017). Ontology-based deep learning for human
1465 behavior prediction with explanations in health social networks. Information sciences, 384, 298-
1466 313.
1467 Pianosi, F., & Raso, L. (2012). Dynamic modeling of predictive uncertainty by regression on absolute
1468 errors. Water Resources Research, 48(3). W03516, doi:10.1029/2011WR010603.
1469 Prechelt, L. (1998). Automatic early stopping using cross validation: quantifying the criteria. Neural
1470 Networks, 11(4), 761-767.
1471 Racah, E., Beckham, C., Maharaj, T., Ebrahimi Kahou, S., & Pal, C. (2016). ExtremeWeather: A
1472 large-scale climate dataset for semi-supervised detection, localization, and understanding of
1473 extreme weather events. arXiv, arXiv-1612.02095.
1474 Radovic, A., Williams, M., Rousseau, D., Kagan, M., Bonacorsi, D., Himmel, A., ... & Wongjirad, T.
1475 (2018). Machine learning at the energy and intensity frontiers of particle
1476 physics. Nature, 560(7716), 41-48.
36
1477 Rasmussen, C. E., and C. K. I. Williams (2006), Gaussian Processes for Machine Learning, MIT
1478 Press, Cambridge, Mass.
1479 Rasouli, K., Hsieh, W. W., & Cannon, A. J. (2012). Daily streamflow forecasting by machine learning
1480 methods with weather and climate inputs. Journal of Hydrology, 414, 284-293.
1481 Racah, E., Beckham, C., Maharaj, T., Kahou, S. E., Prabhat, M., & Pal, C. (2017). ExtremeWeather: A
1482 large-scale climate dataset for semi-supervised detection, localization, and understanding of
1483 extreme weather events. In Advances in Neural Information Processing Systems (pp. 3402-3413).
1484 Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., & Carvalhais, N. (2019). Deep
1485 learning and process understanding for data-driven Earth system science. Nature, 566(7743),
1486 195-204.
1487 Ren, W. W., Yang, T., Huang, C. S., Xu, C. Y., & Shao, Q. X. (2018). Improving monthly streamflow
1488 prediction in alpine regions: integrating HBV model with Bayesian neural network. Stochastic
1489 Environmental Research and Risk Assessment, 32(12), 3381-3396.
1490 Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization
1491 in the brain. Psychological Review. 65(6), 386-408. doi:10.1037/h0042519.
1492 Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and
1493 use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215.
1494 Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). Learning representations by back-
1495 propagating errors. Nature, 323(6088), 533-536.
1496 Sahoo, S., Russo, T. A., Elliott, J., & Foster, I. (2017). Machine learning algorithms for modeling
1497 groundwater level changes in agricultural regions of the US. Water Resources Research, 53(5),
1498 3878-3895.
1499 Samek, W., & Müller, K. R. (2019). Towards explainable artificial intelligence. In Explainable AI:
1500 interpreting, explaining and visualizing deep learning (pp. 5-22). Springer, Cham.
1501 Sawicz, K. A., Kelleher, C., Wagener, T., Troch, P., Sivapalan, M., & Carrillo, G. (2014).
1502 Characterizing hydrologic change through catchment classification. Hydrology and Earth System
1503 Sciences, 18(1), 273.
1504 Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., & Ng, A. Y. (2011). On random weights and
1505 unsupervised feature learning. In ICML (Vol. 2, No. 3, p. 6).
1506 Sengupta, S., Basak, S., Saikia, P., Paul, S., Tsalavoutis, V., Atiah, F., ... & Peters, A. (2020). A
1507 review of deep learning with special emphasis on architectures, applications and recent
1508 trends. Knowledge-Based Systems, 194, 105596.
1509 Settles, B. (2011). From theories to queries: Active learning in practice. In Active Learning and
1510 Experimental Design workshop In conjunction with AISTATS 2010 (pp. 1-18).
1511 Seyoum, W. M., Kwon, D., & Milewski, A. M. (2019). Downscaling GRACE TWSA data into high-
1512 resolution groundwater level anomaly using machine learning-based models in a glacial aquifer
1513 system. Remote Sensing, 11(7), 824.
1514 Shen, C. (2018). A transdisciplinary review of deep learning research and its relevance for water
1515 resources scientists. Water Resources Research, 54(11), 8558-8593.
1516 Shen, C., Laloy, E., Elshorbagy, A., Albert, A., Bales, J., Chang, F. J., ... & Fang, K. (2018). HESS
1517 Opinions: Incubating deep-learning-powered hydrologic science advances as a community.
1518 Hydrology and Earth System Sciences (Online), 22(11).
1519 Shi, X., Chen, Z., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2015). Convolutional LSTM
1520 network: A machine learning approach for precipitation nowcasting. In Advances in neural
1521 information processing systems (pp. 802-810).
37
1522 Shi, X., Gao, Z., Lausen, L., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2017). Deep
1523 learning for precipitation nowcasting: A benchmark and a new model. In Advances in neural
1524 information processing systems (pp. 5617-5627).
1525 Singh, S. K., & Bárdossy, A. (2012). Calibration of hydrological models on hydrologically unusual
1526 events. Advances in Water Resources, 38, 81-91.
1527 Smith, J., & Eli, R. N. (1995). Neural-network models of rainfall-runoff process. Journal of water
1528 resources planning and management, 121(6), 499-508.
1529 Smith, R. G., & Majumdar, S. (2020). Groundwater storage loss associated with land subsidence in
1530 Western United States mapped using machine learning. Water Resources Research, 56(7),
1531 e2019WR026621. https://doi.org/ 10.1029/2019WR026621
1532 Sohangir, S., Wang, D., Pomeranets, A., & Khoshgoftaar, T. M. (2018). Big Data: Deep Learning for
1533 financial sentiment analysis. Journal of Big Data, 5(1), 3.
1534 Solomatine, D. P., & Shrestha, D. L. (2009). A novel method to estimate model uncertainty using
1535 machine learning techniques. Water Resources Research, 45(12). W00B11,
1536 doi:10.1029/2008WR006839.
1537 Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). Ladder variational
1538 autoencoders. In Advances in neural information processing systems (pp. 3738-3746).
1539 Sorooshian, S., Hsu, K. L., Gao, X., Gupta, H. V., Imam, B., & Braithwaite, D. (2000). Evaluation of
1540 PERSIANN system satellite-based estimates of tropical rainfall. Bulletin of the American
1541 Meteorological Society, 81(9), 2035-2046.
1542 Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a
1543 simple way to prevent neural networks from overfitting. The journal of machine learning research,
1544 15(1), 1929-1958.
1545 Sun, A. Y., Scanlon, B. R., Zhang, Z., Walling, D., Bhanja, S. N., Mukherjee, A., & Zhong, Z. (2019).
1546 Combining physically based modeling and deep learning for fusing GRACE satellite data: Can we
1547 learn from mismatch?. Water Resources Research, 55(2), 1179-1195.
1548 https://doi.org/10.1029/2018WR023333
1549 Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013, February). On the importance of initialization
1550 and momentum in deep learning. In International conference on machine learning (pp. 1139-
1551 1147).
1552 Tahmasebi, P., Kamrava, S., Bai, T., & Sahimi, M. (2020). Machine learning in geo-and environmental
1553 sciences: From small to large scale. Advances in Water Resources, 103619.
1554 Tao, Y., Gao, X., Hsu, K., Sorooshian, S., & Ihler, A. (2016). A deep neural network modeling
1555 framework to reduce bias in satellite precipitation products. Journal of Hydrometeorology, 17(3),
1556 931-945.
1557 Tartakovsky, A. M., Marrero, C. O., Perdikaris, P., Tartakovsky, G. D., & Barajas-Solano, D. (2020).
1558 Physics-informed deep neural networks for learning parameters and constitutive relationships in
1559 subsurface flow problems. Water Resources Research, 56(5), e2019WR026731.
1560 https://doi.org/10. 1029/2019WR026731
1561 Tennant, C., Larsen, L., Bellugi, D., Moges, E., Zhang, L., & Ma, H. (2020). The utility of information
1562 flow in formulating discharge forecast models: a case study from an arid snow‐dominated
1563 catchment. Water Resources Research, 56(8), e2019WR024908.
1564 Tennant, H., Neilson, B. T., Miller, M. P., & Xu, T. (2021). Ungaged inflow and loss patterns in urban
1565 and agricultural sub-reaches of the Logan River Observatory. Hydrological Processes, doi:
1566 10.1002/hyp.14097.
38
1567 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
1568 Statistical Society. Series B (Methodological). 58(1),267-288.
1569 Torres, A. F., Walker, W. R., & McKee, M. (2011). Forecasting daily potential evapotranspiration using
1570 machine learning and limited climatic data. Agricultural Water Management, 98(4), 553-562.
1571 Toth, E. (2013). Catchment classification based on characterisation of streamflow and precipitation
1572 time series. Hydrology & Earth System Sciences, 17(3).
1573 Turing, A. (1950). Computing Machinery and Intelligence. Mind. 59(236), 433–460.
1574 doi:10.1093/mind/LIX.236.433
1575 Tyralis, H., Papacharalampous, G., Burnetas, A., & Langousis, A. (2019). Hydrological post-
1576 processing using stacked generalization of quantile regression algorithms: Large-scale application
1577 over CONUS. Journal of Hydrology, 577, 123957.
1578 Van den Oord, A., Dieleman, S., & Schrauwen, B. (2013). Deep content-based music
1579 recommendation. In Advances in neural information processing systems (pp. 2643-2651).
1580 Vandal, T., Kodra, E., & Ganguly, A. R. (2017). Intercomparison of machine learning methods for
1581 statistical downscaling: The case of daily and extreme precipitation. Theoretical and Applied
1582 Climatology, 1-14.
1583 Vapnik, V.N. (1995). The Nature of Statistical Learning Theory. New York: Springer.
1584 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.
1585 (2017). Attention is all you need. In Advances in neural information processing systems (pp.
1586 5998-6008).
1587 Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P. A., & Bottou, L. (2010). Stacked
1588 denoising autoencoders: Learning useful representations in a deep network with a local denoising
1589 criterion. Journal of machine learning research, 11(12).
1590 Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption
1591 generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.
1592 3156-3164).
1593 Wang, C., Duan, Q., Gong, W., Ye, A., Di, Z., & Miao, C. (2014). An evaluation of adaptive surrogate
1594 modeling based optimization with two benchmark problems. Environmental Modeling & Software,
1595 60, 167-179.
1596 Wang, N., Zhang, D., Chang, H., & Li, H. (2020). Deep learning of subsurface flow via theory-guided
1597 neural network. Journal of Hydrology, 584, 124700.
1598 Wu, B., Zheng, Y., Wu, X., Tian, Y., Han, F., Liu, J., & Zheng, C. (2015). Optimizing water resources
1599 management in large river basins with integrated surface water‐groundwater modeling: A
1600 surrogate‐based approach. Water Resources Research, 51(4), 2153-2173.
1601 Wu, H., Fang, W. Z., Kang, Q., Tao, W. Q., & Qiao, R. (2019). Predicting effective diffusivity of porous
1602 media from images by deep learning. Scientific reports, 9(1), 1-12.
1603 https://doi.org/10.1038/s41598-019-56309-x.
1604 Wu, J., Yin, X., & Xiao, H. (2018). Seeing permeability from images: fast prediction with convolutional
1605 neural networks. Science bulletin, 63(18), 1215-1222. https://doi.org/10.1016/j.scib.2018.08.006/
1606 Wunsch, A., Liesch, T., & Broda, S. (2018). Forecasting groundwater levels using nonlinear
1607 autoregressive networks with exogenous input (NARX). Journal of Hydrology, 567, 743-758.
1608 Xiang, Z., Yan, J., & Demir, I. (2020). A rainfall‐runoff model with LSTM‐based sequence‐to‐sequence
1609 learning. Water resources research, 56(1), e2019WR025326.
39
1610 Xu, T., Valocchi, A. J., Choi, J., & Amir, E. (2014). Use of machine learning methods to reduce
1611 predictive error of groundwater models. Groundwater, 52(3), 448-460.
1612 Xu, T., Valocchi, A. J., Ye, M., & Liang, F. (2017). Quantifying model structural error: Efficient
1613 Bayesian calibration of a regional groundwater flow model using surrogates and a data‐driven
1614 error model. Water Resources Research, 53(5), 4084-4105. doi:10.1002/ 2016WR019831.
1615 Xu, T., Deines, J. M., Kendall, A. D., Basso, B., & Hyndman, D. W. (2019). Addressing challenges for
1616 mapping irrigated fields in subhumid temperate regions by integrating remote sensing and
1617 hydroclimatic data. Remote Sensing, 11(3), 370.
1618 Xu, T., Guo, Z., Liu, S., He, X., Meng, Y., Xu, Z., ... & Song, L. (2018). Evaluating different machine
1619 learning methods for upscaling evapotranspiration from flux towers to the regional scale. Journal
1620 of Geophysical Research: Atmospheres, 123(16), 8674-8690.
1621 Yang, J., Jakeman, A., Fang, G., & Chen, X. (2018). Uncertainty analysis of a semi-distributed
1622 hydrologic model based on a Gaussian Process emulator. Environmental Modeling & Software,
1623 101, 289-300.
1624 Yapo, P. O., Gupta, H. V., & Sorooshian, S. (1996). Automatic calibration of conceptual rainfall-runoff
1625 models: sensitivity to calibration data. Journal of Hydrology, 181(1-4), 23-48.
1626 Yaseen, Z. M., El-Shafie, A., Jaafar, O., Afan, H. A., & Sayl, K. N. (2015). Artificial intelligence based
1627 models for stream-flow forecasting: 2000–2015. Journal of Hydrology, 530, 829-844.
1628 Yoon, H., Jun, S.-C., Hyun, Y., Bae, G.-O., & Lee, K.-K. (2011). A comparative study of artificial
1629 neural networks and support vector machines for predicting groundwater levels in a coastal
1630 aquifer. Journal of Hydrology, 396(1–2), 128–138. https://doi.org/10.1016/J.
1631 JHYDROL.2010.11.002Yuan, Q., Shen, H., Li, T., Li, Z., Li, S., Jiang, Y., ... & Zhang, L. (2020).
1632 Deep learning in environmental remote sensing: Achievements and challenges. Remote Sensing
1633 of Environment, 241, 111716.
1634 Zeng, R., Cai, X., Ringler, C., & Zhu, T. (2017). Hydropower versus irrigation—an analysis of global
1635 patterns. Environmental Research Letters, 12(3), 034006.
1636 Zhang, D., Zhang, W., Huang, W., Hong, Z., & Meng, L. (2017). Upscaling of surface soil moisture
1637 using a deep learning model with VIIRS RDR. ISPRS International Journal of Geo-Information,
1638 6(5), 130.
1639 Zhang, J., Zhu, Y., Zhang, X., Ye, M., & Yang, J. (2018). Developing a Long Short-Term Memory
1640 (LSTM) based model for predicting water table depth in agricultural areas. Journal of hydrology,
1641 561, 918-929. https://doi.org/10.1016/j.jhydrol.2018.04.065.
1642 Zhang, J., Zheng, Q., Chen, D., Wu, L., & Zeng, L. (2020). Surrogate‐Based Bayesian Inverse
1643 Modeling of the Hydrological System: An Adaptive Approach Considering Surrogate
1644 Approximation Error. Water Resources Research, 56, e2019WR025721. https://
1645 doi.org/10.1029/2019WR025721
1646 Zhao, W. L., Gentine, P., Reichstein, M., Zhang, Y., Zhou, S., Wen, Y., ... & Qiu, G. Y. (2019).
1647 Physics‐constrained machine learning of evapotranspiration. Geophysical Research Letters,
1648 46(24), 14496-14507.
1649 Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis lectures on
1650 artificial intelligence and machine learning, 3(1), 1-130.
40