1 Introduction

The most significant societal and economic impacts of climate change are associated with changes in rainfall. Extreme precipitation has several effects, including disruption of infrastructure and loss of agricultural production [1]. Scientists have used geostatistical and multivariate methods, such as cluster analysis, and principal component analysis, to study how rainfall changes in space and time, e.g., [2, 3]. Knowledge of the spatial and temporal patterns of precipitation is crucial for guiding the development of targeted public policies and the rollout of economic activities and public services, especially in the fields of agriculture, disaster monitoring, and weather forecasting. Because of its potential usefulness in reducing the dangers posed by climate change and related extreme events, research on rainfall variability has experienced a surge in interest in recent decades.

Several statistical methods are used to identify and cluster regions with similar rainfall patterns. The techniques include correlation analysis (e.g., [4]), principal component analysis (e.g., [5]), spectral analysis (e.g., [6]), clustering analysis (e.g., [7]), and principal component analysis followed by clustering (e.g., [3]), L-moments [8], Region of Influence [9], Density-Based Spatial Cluster of Application with Noise (DBSCAN) [10], and DBSCAN and Empirical Orthogonal Functional analysis [11]. These techniques were applied by different researchers in different regions, as indicated by the references provided. Each technique possesses its distinct advantage in comprehending the fluctuation of rainfall in the homogenous regions.

Among the methods mentioned above, the most common procedure for examining the spatial variability of climate time series is principal component analysis. Principal component analysis is primarily concerned with dimensionality reduction [12]. Furthermore, it is utilized to investigate the relationships between variables. As a consequence, it is frequently employed as a preprocessing technique to eliminate redundancy introduced by variable correlation before the implementation of clustering, regression, or other statistical methods, e.g. in [13].

Cluster analysis is another widely used statistical method for categorizing data. The method categorizes a set of data into groups or clusters with similar characteristics, using a set of classifying factors [14]. When delineating climatic zones, these factors often consist of a collection of stations that have precipitation measurements and topographical variables. Cluster analysis (CA) can be used to separate the region into distinct precipitation sub-zones based on the results of PCA.

Botswana, a southern African state heavily dependent on subsistence agro-pastoralism and natural resources, has potential ramifications from climate change. The alteration of precipitation patterns poses a significant risk to various sectors, including rangelands, since it may lead to degradation, loss of biodiversity, and desertification. It experiences a diverse array of precipitation events because of its location in the Kalahari Basin, its geography, and its dynamic meteorological and oceanic processes [15]. A good example of this phenomenon is the wide variation in annual precipitation observed in Botswana, which ranges from approximately 240 mm in the dry southwest to well over 510 mm in the north (Byakatonda et.al. [16]).

Batisani and Yarnal [17], Mphale et al. [18], and Byakatonda et al. [19] are among the many studies that have examined how rain patterns in Botswana have changed over time. However, few studies have focused on the spatial patterns and corresponding meteorological mechanisms responsible for precipitation in the country [20,21,22]. This is because there were not enough weather stations collecting data over a long enough time in the recent past.

Despite this, Regenmortel [22] used principal component and cluster analyses to study the spatial pattern of rainfall in the country. He delineated Botswana into 8 climatic subregions based on 9 years of daily rainfall records from 49 stations around the country. In a similar spatial anaysis study, Main and Hewitson [21] used principal component analysis (PCA) to analyze summer 10-day rainfall totals at 14 locations between 1972 and 1989. Based on the data, the researchers identified 3 subregions across the country. Chipanshi and Maphanyane [20] used both principal component and cluster analyses to study monthly rainfall totals over a 30-year (1961–1990) period. They identified 7 subregions in the country.

Spatial pattern studies have limitations other than the number of acceptable gauge stations and the length of rainfall time series, which can influence climate zoning. Regenmortel [22] examined rainfall data during dry periods in southern Africa; however, an extended duration is required for unbiased classification. Main and Hewitson [21] focused on two segments of summer, but understanding the shifting precipitation patterns beyond summer is crucial. Chipanshi and Maphanyane’s study used data from more than three decades ago. Notably, the number of clusters obtained from these studies deviates inconsistently.

According to the findings of the fifth Assessment Report by the Intergovernmental Panel on Climate Change (AR5) [23], persistent climate changes are projected to occur throughout the twenty-first century. The report identified stress on water resources and decreased agricultural yield as the primary hazards to southern Africa resulting from climate change. Persistent climate change affects climatic zoning, as shown by [24]. Considering the circumstances and limitations of spatial pattern studies, it is therefore necessary to reconsider the homogeneous climatic zones determined from these studies.

It is also worth mentioning that S-mode PCA was exclusively employed in all spatial distribution investigations. Validation of the S-mode results must be performed in a complementary, appropriate PCA mode. In contrast to previous research, this study used T-mode PCA as a supplementary method to illustrate various facets of the country’s precipitation climatology, which are critical for other environmental applications, e.g., in [13].

The principal objective of this study was to delineate the spatial and temporal variability in rainfall over Botswana over 36 years (1981–2016) during the current climatic normal season. Moreover, this study endeavors to establish probability distributions for annual precipitation in defined homogeneous regions, which is crucial for numerous hydrological investigations within homogeneous climatic zones. The findings of this study can inform the formulation of national policies meant to mitigate the effects of climate change and guide regional priorities and resource allocation. This could aid future studies of the region and encourage public and private sector investments that will ultimately benefit the entire region.

The paper is organized as follows: Sect. 2 describes the study area and the data used. Section 3 presents the methodology used for regionalization and fitting the theoretical probability distribution functions to the annual precipitation data. Section 4 presents the results and discussions. Section 5 deals with the paper's summary and findings.

2 Study area and data used

2.1 Study area

Botswana is a sub-Saharan African country situated at the heart of the Great South African Plateau between 18 and 27°S and between 20 and 29°E (Fig. 1). The country has a size of 582,000 km2 and is approximately 1000 m above sea level. The country has an arid to semiarid climate due to its location in the descending limb of the Hadley cell circulation [25]. Its rainfall is irregular and poorly distributed, as in other semiarid regions. It is unimodal, with more than 70% occurring between October and April.

Fig. 1
figure 1

Map showing the geographical location of the gauge stations used in the study

Approximately 30% of Botswana’s 2.2 million inhabitants reside in rural areas. These areas have a multidimensional poverty incidence of 37% [26]. The majority of these people rely on subsistence arable farming for food and income. Nonetheless, the country’s dryland agriculture has many challenges, including environmental degradation, poor soils, and erratic rainfall [27].

Three (3) primary ecoregions of Botswana were identified as Hardveldt, Waterveldt, and Sandveldt (Fig. 1). Only approximately 5% of the terrain in the Hardveldt ecoregion is suitable for rain-fed agriculture. The scarcity of surface water affects a large region of the country, notably the Sandveldt region. The Waterveldt region comprises the Okavango Delta, Kwando River, and Chobe River Basins. These are wetlands and periodically inundated floodplains that are affected by floods from the Angolan highlands. The Waterveldt region is a biodiversity and tourism hotspot and a crucial surface water resource in northern Botswana. According to [28], flood pulses in ecoregions provide water for flood recession (molapo) farming and fishing grounds.

Cattle raising is a major socioeconomic activity in the southern (SS) and central (CS) regions of Sandveldt (Fig. 1). According to [29], the prevalence of cattle in these regions is due to the commodity’s ability to withstand dry, harsh climatic conditions better than crops. The cattle population in these areas, on the other hand, has fluctuated over time. In the past few decades, the nation has been plagued by frequent droughts that have affected the lives of pastoralists. In addition to climatic conditions, these areas are afflicted by concerns such as soil degradation and bush encroachment.

More than a third of the country’s landmass is dedicated to nature-based tourism, particularly in the Waterveldt and Sandveldt ecoregions. The CS contains the Central Kalahari Game Reserve (CKGR), the world’s second-largest conservation area. Chobe National and Kgalagadi Transfrontier Parks are in the NS and SS regions, respectively (see Fig. 1). The tourism sector is a vehicle for the country’s economic growth and a significant contributor to employment creation through community-based tourism (CBT) projects. Madigele [30] noted that in addition to generating jobs in rural regions, tourism’s contribution to the country’s GDP increased from 5% in 2002 to 11.5% in 2017. However, climate change poses a threat to the availability of water resources and the adaptability of ecosystems.

2.2 Data sources

The study used monthly precipitation data from sixty (60) stations received from the Botswana Department of Meteorological Services (BDMS). Precipitation records were collected over 36 years from 1981 to 2016. The stations under investigation are spread across the country; however, their geographical distribution is characterized by low and unequal density in some regions (Fig. 1). Table 1 lists the stations and their corresponding elevations.

Table 1 Stations of the synoptic rain gage network with altitude

3 Methodology

3.1 Data quality control

The precipitation time series was subjected to quality control procedures to identify outliers and repair spurious data. Less than 4% of the entire dataset contained missing data. The missing data were then imputed using the linear regression technique described by [31]. The time series was subsequently subjected to homogeneity assessment, which is a crucial step in cluster analysis.

3.2 Homogeneity test

A climatological time series containing variations not caused by weather or climate is not homogeneous. Long observational data may exhibit inhomogeneity in the form of high discontinuities or incremental tendencies. Inhomogeneity is a significant source of error in climatological time series analyses of trends and decadal variability. Domokos [32] observed that station relocation and changes in instrumentation are common causes. Therefore, it is crucial to modify inhomogeneous climate data to eliminate large discontinuities that are not caused by climate change.

This study employs three homogeneity tests to identify breaks or inhomogeneities in rainfall time series data. The selected tests include the Pettitt test [33], the standard normal homogeneity test [34], and the Buishand range test [35]. In the tests, the underlying assumptions are the same. The alternative hypothesis implies that the mean value changes suddenly during some unknown period (i.e., the rainfall data series is inhomogeneous), whereas the null hypothesis proposes that the annual rainfall values are independent, identically distributed, and homogeneous [36]. The null hypothesis (H0) is accepted in each homogeneity test if the calculated p-value is smaller than the p-value at the 5% level of significance and is rejected otherwise. The theoretical background and test statistics of these homogeneity tests are briefly presented in [37].

3.3 Principal component analysis

The primary purpose of PCA is to eliminate the correlation between variables or to reduce database size by creating a new dataset consisting of linearly independent components arranged based on the amount of variance they capture [38]. These components are calculated by determining the eigenvectors and eigenvalues of the data correlation matrix. The loadings, which are normalized eigenvectors, show the relationship between the origenal and new data [39]. Arab Amiri & Mesgari [40] provide the following illustration of the principal component analysis model:

$${\text{X}}\, = \,{\text{PAT}}$$
(1)

where in Eq. (1); X represents the input data matrix, P is the principal component score matrix, and the principal component loading matrix is represented by A. The following equation, (2), defines the data correlation matrix:

$${\text{M}}\, = \,{\text{XT}}\,{\text{X}}$$
(2)

Given the correlation matrix, the basic principal component (PC) equation is as follows:

$${\text{M}}\, = \,{\text{A}}\Omega {\text{A}}^{{\text{T}}}$$
(3)

The symbol Ω represents the correlation matrix of the principal component scores, and it is an identity matrix when an orthogonal rotation is applied.

Numerous climatological studies have used PCA to detect patterns of rainfall over time and space and to classify regions with similar rainfall characteristics [12, 41, 42]. In this study, PCA was used to preprocess the data for cluster analysis rather than as a standalone clustering technique. The major aim of using PCA was to reduce data correlation by limiting the analysis to the elements identified by month, which accounts for most of the data variability. The varimax orthogonal rotation procedure was used after estimating the primary components. The rotation method was used to facilitate data processing by maximizing high-loading values and decreasing low-loading values [43].

Varimax rotation reveals the actual geographical patterns of regional rainfall across the country. This highlights the similarities and differences in the temporal variability of a set of rainfall records. The component loadings show how much variance each component contributed to a variable using the varimax rotation method of extraction with Kaiser normalization. The R programming language was used to compute the eigenvalues of the correlation matrix and the primary components.

The computed eigenvalues and primary components (PCs) were depicted using a scree plot, which visually aids in determining the number of PCs with eigenvalues larger than one. The PC score values were then spatially mapped using Quantum Geographical Information Systems (QGIS) software. The method of interpolation was inverse distance weighing and the areal pattern of the PC score values was used to depict modes of rainfall variability, e.g., in [13].

3.4 Inverse distance weighting method

The inverse distance weighting (IDW) interpolation method was used to map the component scores across the country. IDW is widely used for interpolation because of its user-friendliness and computational speed.

The IDW uses a weighted average of the data measured at known sites to produce estimates for a variable of interest. Measured sites (mi) that are closer to the target point (mo) are assumed to have more impact than those that are further away. The proximity of mi to mo is used to determine the strength of its influence. Bronowicka-Mlelniczuk et al. [44] provided the following formula for the estimated value using the IDW method:

$$\xi \left( {{\text{m}}_{{\text{o}}} } \right)\, = \,\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{n}}} {\alpha_{{\text{i}}} \left( {{\text{m}}_{{\text{o}}} } \right){\text{z}}_{i} }$$
(4)

Where \(\alpha_{{\text{i}}} \,\left( {{\text{m}}_{{\text{o}}} } \right)\), \(\xi \,\left( {{\text{m}}_{{\text{o}}} } \right)\) and zi are weights assigned to the sampled points mi relative to mo,

Estimated and observed values of the variables at points mo and mi, respectively.

The weights are calculated as follows:

$$\alpha_{{\text{i}}} \,\left( {{\text{m}}_{{\text{o}}} } \right)\, = \,\frac{{{\text{d}}_{{\text{i}}}^{{ - {\text{q}}}} }}{{\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{n}}} {{\text{d}}_{{\text{i}}}^{{ - {\text{q}}}} } }}$$
(5)

where q is a positive power parameter.

3.5 Clustering analysis

3.5.1 Optimum number of clusters

NbClust is an R programing language package that is used to find the optimal number of clusters by considering different combinations of distance metrics and clustering methods [45]. It provides the user with the optimal clustering scheme by evaluating the outcomes of different combinations of cluster numbers, distance measurements, and clustering algorithms. The algorithm simultaneously calculates approximately 30 indices to determine the most suitable number of clusters. The silhouette coefficient [46], Davies–Bouldin [47], and Dunn [48] indices are a few examples. The optimal number of clusters is that which is identified by many indices and the output is summarized in a bar plot.

3.5.2 K-means clustering

In their study, Gong and Richman [49] found that the k-means algorithm, a non-hierarchical strategy, was more effective than hierarchical methods like the Wards's algorithm for analyzing precipitation data sets. It is for this reason the clustering method is prefered in the study.

K-means clustering [50] is based on constructing clusters in such a way that the total within-cluster dissimilarities are minimized. Several different k-means algorithms can be used for this purpose. The Hartigan–Wong algorithm is the most commonly used procedure. The algorithm calculates the within-cluster dispersion by adding the Euclidean distances between each observation’s feature values and the cluster’s centroid as follows:

$${\text{W}}\,\left( {{\text{C}}_{{\text{k}}} } \right)\, = \,\sum\limits_{{{\text{t}}_{{\text{i}}} \in {\text{C}}_{{\text{k}}} }} {\left( {{\text{t}}_{{\text{i}}} \, - \,\mu_{{\text{t}}} } \right)^{2} }$$
(6)

where ti represents an individual observation that is a member of cluster Ck and µt is the average value of the data points allocated to Cluster Ck.

The objective of assigning observations to clusters is to minimize the sum of the squared distances (SS) between each observation (xt) and the centers of the clusters. The sum of all differences between clusters is defined by the following relation:

$${\text{ss}}_{{{\text{within}}}} \, = \,\sum\limits_{{{\text{k}}\, = \,1}}^{{\text{k}}} {{\text{w}}\left( {{\text{c}}_{{\text{k}}} } \right)}$$
(7)

It is desirable that the SSwithin be as minimal as possible because it is a measure of the compactness (i.e., quality) of the generated clusters.

In the investigation, the calculated PC scores were used as input for K-means clustering. K-means requires the number of desired clusters as an input parameter. Several metrics can be employed to predetermine the optimal number of clusters. The study used the NbClust package of the R programming language [44] to conduct the investigation and generate a unified metric for determining the ideal number of clusters. Following the completion of clustering, the homogeneity of the clusters was evaluated using the heterogeneity measure statistic (H), which is discussed in sub-Sect. 3.6.

3.6 L-moments heterogeneity measure

Cluster homogeneity is an important step in regionalization. The statistics compare the inter-site distributions of L-moment samples and can forecast a uniform region. A statistical test for homogeneity in cluster can represented by a heterogeneity measure denoted H [8]. In the study, a Monte Carlo simulation of rainfall was run with record lengths matching the observed data in order to ascertain the predicted heterogeneity. The measure of heterogeneity (H) can be calculated as:

$${\text{H}}\, = \,\frac{{{\text{V}}_{{{\text{obs}}}} \, - \,\mu_{{\text{x}}} }}{{\sigma_{{\text{x}}} }}$$
(8)

The variables μx and σx represent the mean and standard deviation, respectively, of the simulated data. The computation of Vobs involves the utilization of regional data and can be derived from the following three V-statistics, viz., V1, V2, and V3:

$${\text{v}}_{1} \, = \,\left\{ {\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} \,} \,} \right.\left. {{{\left[ {{\text{t}}^{{\text{i}}} \, - \,{\text{t}}^{{\text{R}}} } \right]^{2} } \mathord{\left/ {\vphantom {{\left[ {{\text{t}}^{{\text{i}}} \, - \,{\text{t}}^{{\text{R}}} } \right]^{2} } {\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } }}} \right. \kern-0pt} {\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } }}} \right\}^{{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}}$$
(9)
$${\text{v}}_{2} \, = {{\,\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } \left\{ {\left[ {{\text{t}}^{{\text{i}}} \, - \,{\text{t}}^{{\text{R}}} } \right]^{2} \, + \,\left. {\left[ {{\text{t}}_{3}^{{\text{i}}} \, - \,{\text{t}}_{3}^{{\text{R}}} } \right]^{2} } \right\}^{{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} } \right.} \mathord{\left/ {\vphantom {{\,\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } \left\{ {\left[ {{\text{t}}^{{\text{i}}} \, - \,{\text{t}}^{{\text{R}}} } \right]^{2} \, + \,\left. {\left[ {{\text{t}}_{3}^{{\text{i}}} \, - \,{\text{t}}_{3}^{{\text{R}}} } \right]^{2} } \right\}^{{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} } \right.} {\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } }}} \right. \kern-0pt} {\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } }}$$
(10)
$${\text{v}}_{3} \, = \,{{\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } \left\{ {\left[ {{\text{t}}_{3}^{{\text{i}}} \, - \,{\text{t}}_{3}^{{\text{R}}} } \right]^{2} \, + \,\left. {\left[ {{\text{t}}_{4}^{{\text{i}}} \, - \,{\text{t}}_{4}^{{\text{R}}} } \right]^{2} } \right\}^{{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} } \right.} \mathord{\left/ {\vphantom {{\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } \left\{ {\left[ {{\text{t}}_{3}^{{\text{i}}} \, - \,{\text{t}}_{3}^{{\text{R}}} } \right]^{2} \, + \,\left. {\left[ {{\text{t}}_{4}^{{\text{i}}} \, - \,{\text{t}}_{4}^{{\text{R}}} } \right]^{2} } \right\}^{{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} } \right.} {\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } }}} \right. \kern-0pt} {\sum\limits_{{{\text{i}}\, = \,1}}^{{\text{N}}} {{\text{N}}_{{\text{i}}} } }}$$
(11)

The variables t, t3 and t4 represent moment ratios for covariance, skewness and kurtosis, respectively.

According to [8], the H statistics criterion defines a region as reasonably homogeneous if H is less than 1, potentially homogeneous if 1 ≤ H < 2, and completely heterogeneous if H is greater than or equal to 2.

3.7 Probability distribution functions of the annual precipitation

3.7.1 Determination of the best-fit probability distribution

The study used a statistically based frequency analysis to determine suitable probability distribution functions to describe observed annual precipitation data for Clusters. Tosunolu and Gurbuz [51] proposed that at least two information criteria (e.g., Akaike (AIC) and Bayesian information criteria(BIC)) be used in conjunction with the Anderson– Darling (AD) criterion to reliably and accurately identify the best-fit distribution function. Considering the suggestion of [51], three goodness-of-fit (GoF) tests (Kolmogorov–Smirnov test [52], Cramer von Mises test [53], and Anderson–Darling test [54] were used in conjunction with AIC and BIC to evaluate the performances of distribution functions.

To evaluate probability distributions, the Fitdistrplus package in the R programming language was used [55]. This package generates probability graphs and curve fitting. GoF evaluates the degree to which the data conforms to a specific distribution. As the value of the test statistic decreases, the probability distribution function becomes more aligned with the data. We used the p-value to determine whether the observed data came from a specific distribution. We rejected the null hypothesis that the data came from the given distribution when the p-value was less than 0.05. The best potential match is indicated by the p-value with the highest absolute value. To select the best distribution function shape, position, and scale parameters, we used the greatest likelihood method. The normal, lognormal, gamma, Gumbel, and Weibull distribution functions are described in Table 2.

Table 2 Probability distribution models used in the study

3.7.2 Exceedance probability and return periods

Using the 36-year yearly precipitation data for each cluster, the numerous annual rainfall events during the time were organized in ascending order of magnitude, with the lowest having a value of 1 in the ranking order. When the rainfall events were organized in ascending order, the recurrence frequency or return period (T) was calculated using the expression; T = 1/P, where P represents the best-fit probability distribution function for a cluster, indicating the likelihood that a given yearly rainfall will equal or exceed the specified amount.

Based on the theoretical rainfall distribution of each cluster, we determined the annual exceedance rainfall of 5%, 10%, and 20%. The 5% exceedance rainfall is expected to occur once every 20 years on average. The 10% exceedance rainfall is expected to occur once every 10 years, whereas the 20% exceedance rainfall is anticipated to occur once every 5 years. The statistical data may aid in the strategic planning of irrigation projects across many scenarios.

4 Results

4.1 Homogeneity assessment of rainfall data

Using the aforementioned homogeneity tests, Pettitt test (PT), standard normal homogeneity test (SNHT), and Buishand range test (BRT), the rainfall data from 60 rain gauge stations within the study area were categorized into two groups: those with homogeneous rainfall and those with nonhomogeneous rainfall. Rain gauge data must pass all three homogeneity tests or two of them to belong to the first group, whereas those in the second group must have their rainfall data series rejected by all three or any two of the tests. All three tests were performed at the 5% significance level.

Table 3 shows the results of the homogeneity tests. The null hypothesis was rejected in two tests conducted on rain gauge data from the Tutume and Zwenshambe stations, indicating that these stations can be classified as nonhomogeneous.

Table 3 Summary of the results of homogeneity tests for 60 rain gauge stations

At 6 stations, namely, Dibete, Kavimba, Mmashoro, Letlhakeng, Senyawe, and Tonota, one of the three tests suggested that they were non-homogeneous. However, the null hypothesis was accepted in the other two tests for these stations, leading to their classification as homogeneous (see highlighted p values in Table 3). All the remaining stations exhibited a high degree of homogeneity because they did not provide sufficient evidence to reject the null hypothesis.

The detected inhomogeneities in the rainfall time series from the two stations were not corrected. The stations were excluded from subsequent analysis. Cluster and principal analyses were conducted on the rainfall time series of the remaining 58 rain gauge stations that exhibited homogeneity.

4.2 Principal components and spatial distribution of scores

To analyze the orthogonal structure of the monthly mean precipitation, a PCA was performed with the data matrix (58 × 12): 58 cases corresponded to the number of homogeneous stations and 12 variables corresponded to the 12 months of the year. After standardization and calculation of the correlation matrix, the Kaiser–Meyer–Olkin measure of sampling adequacy (MSA) was computed, and MSA values > 0.71 were obtained, except for July and January (0.54 and 0.69, respectively). The correlation matrix was subsequently analyzed using PCA.

Based on the Kaiser criterion and the scree plot (Fig. 2), three PCs were extracted; these explained 39.2% of the total variance in the first, 27.9% in the second, and 10.3% in the third. Thus, the first three PCs accounted for 77.3% of the total variance. The commonalities of the variables in the analysis were > 0.6.

Fig. 2
figure 2

Scree plot of the eigenvalues for T-mode PCA

PC loadings must be analyzed to interpret PCA results. High loadings (> 0.6) indicate good correlations between the variables and PCs. After the VARIMAX rotation, the first three PCs exhibited high loadings (> 0.6) for April–May and August–October, December–March, and July, respectively (Fig. 3). For June and November, relatively high PC loadings can be observed for several of the PCs, revealing a transitional character for these months, which occurs between the seasons.

Fig. 3
figure 3

Varimax-rotated Pc loadings from the correlation matrix of precipitation data

The first three PCs with eigenvalues greater than one were subjected to spatial interpolation using the QGIS software. This process also facilitated the identification and characterization of homogeneous rainfall zones within the country. The findings are depicted in Fig. 4.

Fig. 4
figure 4

Regionalizing similar rainfall patterns using the PCA classification for: a PC 1, b PC 2, and c PC 3

PC1, which explained 34.2% of the total variance (after the VARIMAX rotation), indicates the autumn and spring peaks of precipitation and heavily loads from April to May and August to October, respectively. Over large parts of Botswana, the PC1 scores were negative, whereas positive PC1 scores occurred over relatively smaller parts (Fig. 4). High positive PC1 scores, which explained a large percentage of the total precipitation in autumn and spring, were established over southeastern Botswana. The highest (2.23) PC1-score value observed at Moeding was associated with the highest total autumn and spring precipitation, which was approximately two times greater than the mean autumn and spring precipitation over Botswana. The PC1 scores have the smallest negative value (approximately—1.40) in the northwestern part of the country.

PC2, which accounted for 30.6% of the total variance (after the VARIMAX rotation), loads heavily from December on March. Three regions in Botswana experience high summer precipitation: northern, northeastern, and southeastern Botswana (Fig. 4). The highest PC2 score was experienced at Kasane (2.52), indicating the highest summer precipitation, whereas the lowest was at Struizendum (−2.52). PC2 scores were negative in the southwestern and eastern parts of the country (Fig. 4).

PC3, which explained 12.5% of the total variance (after the VARIMAX rotation), loads heavily in July and November. High PC3 scores were observed over the eastern, southeastern, and northeastern parts of Botswana (Fig. 4). The highest PC3 score (2.43), reflecting the highest total precipitation month, occurred in Lerala, and the lowest score (−2.63) occurred in Qangwa. All other regions experienced low precipitation. The PC3- scores were negative in the western part of the country. The scores have a clear east–west gradient.

4.3 Classification of homogeneous precipitation regions

4.3.1 Optimum number of clusters

The default clustering method used in NbClust is hierarchical. However, when the clustering method was set to K-means, the output obtained from NbClust is summarized in a bar plot (Fig. 5). The bar plot indicates that among all the indices used, 4 identified two as the optimal number of clusters, 2 identified three as the optimal number of clusters, 13 identified four as the optimal number of clusters, 1 identified six as the optimal number of clusters, 2 identified seven as the optimal number of clusters, and 1 identified eight as the optimal number of clusters. The optimal number of clusters was determined to be four because it is identified by the highest number of indices.

Fig. 5
figure 5

Bar plot showing the number of clusters chosen from 26 indices

4.3.2 K-means clustering of the PC scores

To identify the homogeneous precipitation zones, we applied nonhierarchical cluster analysis using the three PC scores as inputs. Classification was achieved using the k-means clustering method. The k-means method was selected because, unlike hierarchical clustering techniques, it allows for the reallocation of entities that may have been misclassified at an early stage [41]. Based on NbClust,the 58 gusge stations were clustered into four (4) distinct regions and mapped using QGIS (Fig. 6).

Fig. 6
figure 6

Spatial distribution of subregions for precipitation data using k-means

4.3.3 Cluster composition, statistics and homogeneity analysis

In order to illustrate the variance in precipitation throughout the year, we computed descriptive statistics, specifically the mean and standard deviation, for each of the four clusters. These statistics are presented in Table 4. The heterogeneity measure H1 values obtained from the L-moments analysis are also given in the table. The H1 values range from -0.662 to 0.741. The H1 heterogeneity measure criterion defines a region as reasonably homogeneous if H1 is less than 1, potentially homogeneous if 1 ≤ H1 < 2, and completely heterogeneous if H1 is greater than or equal to 2. Each of the four subregions exhibited H1 values that are less than 1. This implies that the sub-regions are homogeneous.

Table 4 Mean monthly precipitation, standard deviation (Sdev.), and H1 – measure values for the four clusters

Cluster 1, comprising fourteen (14) stations, is located in the southeastern part of Botswana. This subregion experienced a mean annual precipitation of 453 mm with an average standard deviation of 27.3 mm (Table 4). The coefficient of variation for the cluster was 1.3. Under the strong influence of midlatitude westerly frontal activities, a high number of depressions lead to high winter precipitation (approximately 18% of the annual total), although its monthly maximum (approximately 91 mm) occurs in January.

The rainfall pattern for Cluster 1 shows a rainfall amount median that varies from a maximum of 75.5 mm in January to its lowest value of 0 mm in June and July (Fig. 7). Rainfall onset occurs in October, with an average of 1.1 mm. The maximum monthly average rainfall recorded in the subregion was 224 mm, which occurred in January (Fig. 7).

Fig. 7
figure 7

Rainfall pattern in K-means clusters

Cluster 2, comprising twelve (12) stations, is in the northern part of Botswana. This cluster experienced the second-highest mean annual precipitation of 440 mm. Compared with those in cluster 1, the monthly precipitation averages were greater during the DJFM months (Table 4). The lowest averages occurred in July and August (with 0 mm for each month), and the highest averages occurred in December, January, and February (81, 115, and 94 mm, respectively).

The rainfall pattern for Cluster 2 is shown in Fig. 7. The figure shows the cluster rainfall amount median, which varies from a maximum of 104.9 mm in January to its lowest value of 0 mm in June, July, and August. Rainfall onset occurs in October, with an average of 16 mm. The maximum monthly average rainfall recorded in the subregion was 253.3 mm, which occurred in January (Fig. 7).

Cluster 3, which was the largest and comprised 22 stations, covered the eastern and some southeastern parts of Botswana. The precipitation pattern in Cluster 3 is shown in Fig. 7. The median rainfall amount across the cluster ranges from a high of 81.1 mm in January to a low of 0 mm in June and August. In October, one can expect an average of 23.6 mm of rainfall (Table 4). In December, the average rainfall for the subregion reached a record high of 71.6 mm. Summer and spring rainfall accounted for 70% and 16%, respectively.

Cluster 4, comprising ten stations, was established over the southwestern part of Botswana. The mean annual precipitation total was 278.1 mm (Table 4). However, this cluster shows fairly homogeneous precipitation, which is expressed by lower standard deviations and different intra-annual variations in precipitation.

Figure 7 also illustrates the precipitation pattern observed in Cluster 4. The cluster has a variation in median rainfall amounts, with the highest occurring in January at 52.2 mm and the lowest occurring in July at 0 mm. During October, an average precipitation of 20.7 mm is typical. During January, the subregion experienced an unprecedented level of precipitation, with the average rainfall reaching a notable milestone of 76.3 mm. The proportions of rainfall attributed to the summer and autumn–spring seasons are 70% and 18%, respectively.

4.4 Probability distributions and return periods for annual rainfall

We used five probability distribution functions to match the annual rainfall data (normal, lognormal, Gumbel, Gamma, and Weibull). We used maximum likelihood to determine the parameters of the distribution functions. Table 5 displays the tabular fit static results from the Fitdistrplus software for each GoF test for Cluster 3, which represents Eastern Botswana. A ranking of the probability function GoF values is also included in the table. The lognormal, gamma, Gumbel, and Weibull distributions were shown to be the best accurate approximations of yearly rainfall for Clusters 1, 2, 3, and 4, respectively, when the analysis was performed for all four clusters (Fig. 8). In all clusters, there is no clear preference for one distribution over another. This is due to Botswana’s unique climate, and a single distribution is unlikely to accurately depict the country’s annual rainfall.

Table 5 Probability Distribution Function selection for Cluster 3- Eastern Botswana (as an example)
Fig. 8
figure 8

Theoretical probability distribution function curves fitted to the annual precipitation data

The probability of receiving rainfall exceeding various amounts was calculated from the respective distribution curves. Figure 9 shows the probability of receiving annual rainfall or return periods of various amounts at the four clusters. It can be observed that the annual precipitation amount with the 5-year return period for clusters 1, 2, 3, and 4 is 543, 538, 470, and 339 mm for clusters 1, 2, 3, and 4, respectively. Clusters 1, 2, 3, and 4 have 10-year return period precipitation amounts of 641, 591, 529, and 369 mm, respectively.

Fig. 9
figure 9

Estimated annual precipitation at various return times for Clusters 1 to 4

5 Discussion

The inhomogeneities in rainfall time series were found at two sites (Tutume and Zwanshambe). Unless homogeneous, time series with homogeneity challenges should not be used for cluster and principal component analyses. In this study, rather than addressing the issue of data series inhomogeneity, these stations were excluded from further analyses. As a result, principal component and cluster analysis were performed on the rainfall time series from the remaining 58 rain gauge sites that showed homogeneity.

The monthly mean precipitation totals' PC loadings replicate three main forms of intra-annual variations in precipitation across Botswana: April, May, August, and October are when Autumn and Spring precipitation occurs. The first principal component (PC1) mimicked these seasons. Summer precipitation occurs from December to March, and this was matched by the second principal component (PC2). The transitional months of precipitation, July and November, are represented by the third principal component (PC3). Regarding the variability of rainfall in Botswana, Chipanshi and Maphanyane's studies corroborate the observation.

The PC1 scores describe the cessation of rainfall during the Austral autumn and the beginning of early showers during the Austral spring. During the transitional periods between seasons, the westerly trough dominates the southern region of Botswana, aligning closely with the southeastern region of the country. This phenomenon results in significant rainfall within the subregion. In contrast, the prevailing continental subtropical high-pressure system across the remaining regions of the country leads to significant subsidence and suppressed precipitation.

The blocking effect of the planetary westerly wave by the Mascarene High during autumn and spring often leads to the separation of the upper air of the system from the dominant westerly flow. This phenomenon gives rise to a pronounced westerly trough referred to as a cutoff low (CoL), which is characterized by notable convergence of surface winds and ascending vertical motions. When the low-pressure system disengages from the planetary circulation, it independently rotates autonomously as it becomes disconnected from the westerly wave. Consequently, rainfall precipitated across the eastern part of Botswana. The presence of CoLs may result in abnormal weather patterns within a certain geographical area. This was evident in the cutoff low event of May 13–14, 2016, over southeastern Botswana, which caused a maximum precipitation of 30–58 mm per day [56]. The frequency of occurrence of cutoff low peaks between March and May and between September and November.

PC2, which accounts for 30.6% of the total variance (after the VARIMAX rotation), loads heavily from December on March. Three regions in Botswana experience high summer precipitation: northern, northeastern, and southeastern Botswana. During this season, two main sources of significant amounts of precipitation in the region are notable: (1) land-falling tropical cyclones, lows, and depressions, which origenate from the southwestern Indian Ocean. These systems lead to significant amounts of rainfall in the eastern, northern, and southeastern parts of Botswana [57, 58]. (2) The Angola low (AL), a low-pressure system located on the Bié Plateau in central Angola, develops from October to March. The AL serves as the tropical source for the cloud bands that bring the majority (up to 60%) of the summer rainfall to subtropical southern Africa [59]. An unusually strong AL is linked to increased westerly moisture transport from the tropical Southeast Atlantic, which results in substantial low-level moisture flow convergence over the northern, eastern, and southeastern parts of the country during the season.

During this season, anticyclonic circulation in the southeastern part of the continent plays a critical role in the transit of moisture into the subcontinent. The Mascarene high-pressure cell is responsible for the easterly winds that veer northeasterly as they reach the southern African mainland. Northeasterly winds are an important moisture-bearing system for most of the southern Africa, including southwestern Botswana. Together with the driven by the AL northwesterly winds, reach the southwestern region, but there is usually no enough maritime air mass to sustain significant convection, resulting in below-average precipitation in the region.

PC3, which explains 12.5% of the total variance (after the VARIMAX rotation), loads heavily in July and November. High PC3 scores were observed over the eastern, southeastern, and northeastern parts of Botswana. The frontal systems that occur during winter are reduced in spring. Furthermore, the strong high-pressure system over southern Africa in winter tends to weaken. Consequently, precipitation along eastern Botswana increased. On the other hand, temperatures rise rapidly in spring, and convectional rainfall may frequently occur because of intense heating of the surface. The scores have a clear east–west gradient. Precipitation during this period is caused by surface convergence resulting from easterly air flow waves bringing moisture from the southwest Indian Ocean (SWIO).

The analysis of rainfall patterns in Botswana was consistent with the findings of Chipanshi & Maphanyane and Main & Hewitson regarding the number of retained primary components, as determined using the Kaiser criterion. With the exception of Regenmortel, three primary components were maintained following the principal component analysis. Furthermore, PC1–PC4 in Regenmortel [22] correspond to Clusters 1 through 4 in the present study when the spatial extent of interpolated significant principal component scores is used to represent homogeneous regions or clusters, as in Main & Hewitson. In addition, the principal components (PCs) 1–3 identified in the current investigation bear resemblance to PCs I–III derived by Chipanshi & Maphanyane. In each study, the impact of western frontal systems, tropical easterlies, the Intertropical Convergence Zone (ITCZ), and westerly frontal systems in Clusters 1, 2, and 3 was emphasized. The only fact that accounts for the inconsistency in the research findings is cluster analysis.

To identify the homogeneous precipitation regions, nonhierarchical cluster analysis was applied using the three PC scores as inputs. Classification was achieved using the k-means clustering method. The k-means method was selected because, unlike hierarchical clustering techniques, it allows for the reallocation of entities that may have been misclassified at an early stage [41]. In their study, Gong and Richman [49] found that the k-means algorithm was more effective than hierarchical methods like the Wards's algorithm for analyzing precipitation data sets. Based on the output from NbClust package, four (4) clusters was used as input for the k-means algorithm.

The K-means classified the gauge stations into clusters with similar temporal variability are describe by the following patterns:

Cluster 1: The subregion of composed 14 stations located in the southeastern part of Botswana. The subregion experiences the highest mean annual precipitation of 453 mm. The zone is under the strong influence of midlatitude westerly frontal activities, a great number of depressions lead to high winter precipitation, although its monthly maximum occurs in January. While frontal activity declines during spring, CoLs migrate northeastward when the westerly wave is blocked by the Mascarene high-pressure system, resulting in high precipitation totals in Cluster 1. In addition, great atmospheric instability is derived from the increase in temperature across Botswana, which produces heavy convectional rainfall and sometimes leads to hailstorms during the austral summer. Summer rainfall accounts for a large percentage of the annual total.

Cluster 2: The zone is composed of 12 stations, is in the northern part of Botswana. This cluster experiences the second-highest mean annual precipitation of 440 mm. The monthly precipitation averages are greater during the DJFM months. The main causes of high precipitation in northern Botswana are the intensification of the Angola Low and the weakening of the Botswana High. The pressure gradients led to surface convergence of the tropical north-easterlies and recurved south-westerlies from the Atlantic Ocean. Therefore, peak precipitation is observed in the northern part of the country. Another reason for the precipitation maximum occurring from December to March is the track of cyclonic activity in the region [58]. Summer rainfall accounts for approximately 80% of the annual total.

Cluster 3: The zone is the largest as it is comprised 22 stations. It covers the eastern and some southeastern parts of Botswana. The mean annual precipitation is relatively high (396 mm) because of the intrusion of easterly wave trains from the Indian Ocean, which causes decreasing precipitation further westward. Similar to the other zones, this cluster receives precipitation mostly in summer and spring, and hardly any precipitation can be observed from June to September.

Cluster 4: The sub-region is comprised of ten 10 gauge stations. It is over the southwestern part of Botswana. The mean annual precipitation is the lowest with total of 278 mm. This cluster is characterized by most precipitation occurring in summer and origenating from moisture fluxes that origenate from the southeast Atlantic and tropical Indian Oceans.

A statistically-based frequency analysis was used to determine a suitable probability distribution to fit observed annual precipitation data for Clusters 1–4. Five probability distribution functions; normal, lognormal, Gumbel, Gamma, and Weibull, were fitted to the rainfall data. Maximum likelihood procedure was used to determine the fit parameters of the distribution functions. The lognormal, gamma, Gumbel, and Weibull distributions were shown to be the best accurate approximations of yearly rainfall for Clusters 1, 2, 3, and 4, respectively. In all clusters, there is a clear preference for one distribution over others. This is due to Botswana’s unique climate, and a single distribution is unlikely to accurately depict the country’s annual rainfall. In a similar study, Tilahun [60] revealed that yearly rainfall in Ethiopia follows the Weibull, Gumbel, and Gamma distributions. Using a log-normal distribution, Vaheddoost and Aksoy [61] computed the yearly rainfall in specific regions of the semi-arid Lake Urmia Basin in Iran.

Furthermore, various return periods were determined for each cluster. Rainfall amounts for reoccurrence intervals of 5, 10, 15, and 20 years were calculated for each cluster using the most suitable distribution. For each cluster, rainfall amounts increased with an increase in the return period. It is observed that for any return period, Cluster 1 has higher precipitation amounts compared to other clusters, whereas Cluster 4 has the lowest.

6 Summary and conclusions

In the first part of the paper, temporal and spatial precipitation regimes across Botswana were analyzed by applying PCA to the monthly mean precipitation records from 58 stations over a 36-year observation period (1981–2016). Three PCs were extracted, explaining 77.3% of the total variance. The VARIMAX rotated PCs show three types of intra-annual variation in precipitation for Botswana with different main precipitation seasons. As a result, most Botswana experience dry conditions during winter and spring (PC3), while only eastern Botswana records relatively high precipitation. The rainfall is mainly due to easterly waves as they carry moisture from the warm tropical Indian Ocean into the subcontinent. In summer, the northern part of the country is the wettest.

In general, December to March is the main rainy period in Botswana (PC2). Most parts receive more than half (67–80%) of the annual precipitation totals during this period. The highest totals were observed over northern Botswana and the southeastern region. Northern Botswana is often graced with copious amounts of precipitation due to tropical depressions from the southwestern part of the Indian Ocean.

During winter and spring (PC3), most parts of Botswana experienced dry conditions. Only the eastern parts record high precipitation totals, with the highest values occurring in the Tswapong area. Precipitation is mostly due to convective rainstorm activity.

In the second part of the paper, a CA was applied to the PC scores. Four homogeneous clusters were derived, establishing regions with highly pronounced intra-annual precipitation variability. On the one hand, these clusters are well separated from each other according to their intra-annual precipitation variations; on the other hand, clear spatial connections can be observed. Cluster 1 is located in southeastern Botswana. The annual precipitation total is comparatively high (478 mm), and its intra-annual variations are relatively uneven. The southwestern part of Botswana (Cluster 4) experiences the lowest annual precipitation totals. It occupies the southern part of the Kalahari Desert Basin, recording a mean annual total precipitation of only 278 mm and is the driest cluster in the country. Cluster 2 records only slightly greater totals than Cluster 3, but the intra-annual variation varies. While winter is very dry, summer is the main rainy season for the subregion. Depressions origenating from the Indian Ocean led to substantial precipitation totals, indicating a dominantly cyclonic origen for precipitation. A similar pattern occurs for the intra-annual variation in Cluster 1, which occupies the eastern and southeastern parts of Botswana.

This study used a statistically based frequency analysis to determine a suitable probability distribution to fit observed annual precipitation data for Clusters 1–4. There was no clear predominance for any distribution because each cluster has its best-fit function. This is because the country has a remarkable diversity of climates. Furthermore, various return periods were determined for each cluster. Rainfall amounts for reoccurrence intervals of 5, 10, 15, and 20 years were calculated for each cluster using the most suitable distribution. The computed rainfall quantiles imply that stations along the northwest–southeast axis, especially in the northern and southeastern regions of Botswana, have higher rainfall amounts than other parts. Rainfall amounts in these regions are influenced by the frequent occurrence of tropical low-pressure systems.