Extreme Gradient Boosting
Extreme Gradient Boosting
Extreme Gradient Boosting
2022-06-20
Introduction
• Extreme Gradiend Boosting is currently the state-of-the-art algorithm for building predictive models
on real-world datasets.
• Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular
datasets of all sizes.
XGBoost Library
• After winning a popular machine learning competition on kagglein 2016, Tianqi Chen and Carlos
Guestrin authored XGBoost: A Scalable Tree Boosting System to present their algorithm to the larger
machine learning community.
• The adoption result of the algorithm by the ML community is bindings, or functions that tapped into
the core C++ code, started appearing in a variety of other languages, including Python, R, Scala, and
Julia.
• The Extreme in Extreme Gradient Boosting means pushing computational limits to the ex-
treme. Pushing computational limits requires knowledge not just of model-building but also of disk-
reading, compression, cache, and cores.
The following new design features give XGBoost a big edge in speed over comparable ensemble algorithms:
1
1. Approximate split-finding algorithm: XGBoost presents an exact greedy algorithm in addition to a
new approximate split-finding algorithm. The split-finding algorithm uses quantiles, percentages that
split data, to propose candidate splits. In a global proposal, the same quantiles are used throughout
the entire training, and in a local proposal, new quantiles are provided for each round of splitting.
2. Sparsity aware split-finding: Sparse matrices are designed to only store data points with non-zero and
non-null values. This saves valuable space. A sparsity-aware split indicates that when looking for
splits, XGBoost is faster because its matrices are sparse.
3. Parallel computing: it is parallelizable onto GPU’s and across networks of computers, making it feasible
to train models on very large datasets on the order of hundreds of millions of training examples.
4. Cache-aware accessL: The data on a computer is separated into cache and main memory. The cache,
what we use most often, is reserved for high-speed memory.
5. Block compression and sharding:
1. Sharding is a method for distributing a single dataset across multiple databases, which can then
be stored on multiple machines. This allows for larger datasets to be split in smaller chunks and
stored in multiple data nodes, increasing the total storage capacity of the system.
• Block sharding decreases read times by sharding the data into multiple disks that alternate
when reading the data.
2. Block compression helps with computationally expensive disk reading by compressing columns.
Learning Objective
• The learning objective or objective function of a machine learning model determines how well
the model fits the data. When we construct any machine learning model, we do so in the hopes that
it minimizes the loss function across all of the data points we pass in. That’s our ultimate goal, the
smallest possible loss.
• In the case of XGBoost, the learning objective consists of two parts: the loss function and the regular-
ization term.
where L is the training loss function, and Ω is the regularization term. The training loss measures how
predictive our model is with respect to the training data.
A common choice of L is the mean squared error, which is given by:
2
X
L(θ) = (yi − ŷi )2
i
Another commonly used loss function is logistic loss, to be used for logistic regression:
X
L(θ) = [yi ln(1 + e−ŷi ) + (1 − yi ) ln(1 + eŷi )]
i
The regularization term is what people usually forget to add. The regularization term controls the
complexity of the model, which helps us to avoid overfitting.
For mathematical derivation, check the documentation here.
#install.packages(c("xgboost", "caret"))
library(xgboost)
library(caret)
library(MASS)
3
Step 2: Load the Data
For this example we’ll fit a boosted regression model to the Boston dataset from the MASS package.
This dataset contains 13 predictor variables that we’ll use to predict one response variable called mdev, which
represents the median value of homes in different census tracts around Boston.
We can see that the dataset contains 506 observations and 14 total variables.
4
#define final training and testing sets
xgb_train <- xgb.DMatrix(data = train_X, label = train_y)
xgb_test <- xgb.DMatrix(data = test_X, label = test_y)
#define watchlist
watchlist = list(train=xgb_train, test=xgb_test)
#fit XGBoost model and display training and testing data at each round
model = xgb.train(data = xgb_train, max.depth = 3, watchlist=watchlist, nrounds = 70)
5
## [31] train-rmse:1.656522 test-rmse:3.540007
## [32] train-rmse:1.645577 test-rmse:3.545167
## [33] train-rmse:1.614263 test-rmse:3.546557
## [34] train-rmse:1.593455 test-rmse:3.539367
## [35] train-rmse:1.567166 test-rmse:3.528534
## [36] train-rmse:1.555617 test-rmse:3.532916
## [37] train-rmse:1.527377 test-rmse:3.540154
## [38] train-rmse:1.517858 test-rmse:3.525271
## [39] train-rmse:1.492600 test-rmse:3.502901
## [40] train-rmse:1.484267 test-rmse:3.523625
## [41] train-rmse:1.468374 test-rmse:3.530097
## [42] train-rmse:1.442361 test-rmse:3.532107
## [43] train-rmse:1.410727 test-rmse:3.527861
## [44] train-rmse:1.402039 test-rmse:3.526187
## [45] train-rmse:1.380398 test-rmse:3.518224
## [46] train-rmse:1.372660 test-rmse:3.518499
## [47] train-rmse:1.349876 test-rmse:3.514045
## [48] train-rmse:1.345266 test-rmse:3.509992
## [49] train-rmse:1.319305 test-rmse:3.503536
## [50] train-rmse:1.296536 test-rmse:3.484637
## [51] train-rmse:1.284502 test-rmse:3.484824
## [52] train-rmse:1.259046 test-rmse:3.491340
## [53] train-rmse:1.253510 test-rmse:3.494199
## [54] train-rmse:1.246855 test-rmse:3.502560
## [55] train-rmse:1.228881 test-rmse:3.501782
## [56] train-rmse:1.218618 test-rmse:3.514161
## [57] train-rmse:1.193581 test-rmse:3.515703
## [58] train-rmse:1.181366 test-rmse:3.516584
## [59] train-rmse:1.161258 test-rmse:3.523906
## [60] train-rmse:1.137540 test-rmse:3.530229
## [61] train-rmse:1.121475 test-rmse:3.519965
## [62] train-rmse:1.111235 test-rmse:3.523945
## [63] train-rmse:1.087250 test-rmse:3.521362
## [64] train-rmse:1.076125 test-rmse:3.512412
## [65] train-rmse:1.058271 test-rmse:3.521056
## [66] train-rmse:1.054615 test-rmse:3.519675
## [67] train-rmse:1.043884 test-rmse:3.518975
## [68] train-rmse:1.034688 test-rmse:3.522976
## [69] train-rmse:1.022151 test-rmse:3.518511
## [70] train-rmse:1.009072 test-rmse:3.507155
From the output we can see that the minimum testing RMSE is achieved at 50 rounds. Beyond this point,
the test RMSE actually begins to increase, which is a sign that we’re overfitting the training data.
Thus, we’ll define our final XGBoost model to use 50 rounds:
## [1] train-rmse:10.268952
## [2] train-rmse:7.621959
## [3] train-rmse:5.802636
## [4] train-rmse:4.550875
## [5] train-rmse:3.731272
6
## [6] train-rmse:3.212170
## [7] train-rmse:2.876153
## [8] train-rmse:2.679817
## [9] train-rmse:2.491618
## [10] train-rmse:2.375090
## [11] train-rmse:2.313203
## [12] train-rmse:2.211834
## [13] train-rmse:2.174607
## [14] train-rmse:2.139551
## [15] train-rmse:2.114976
## [16] train-rmse:2.068074
## [17] train-rmse:2.051490
## [18] train-rmse:1.995807
## [19] train-rmse:1.970252
## [20] train-rmse:1.934135
## [21] train-rmse:1.900306
## [22] train-rmse:1.871715
## [23] train-rmse:1.826952
## [24] train-rmse:1.811795
## [25] train-rmse:1.801999
## [26] train-rmse:1.779351
## [27] train-rmse:1.754780
## [28] train-rmse:1.746118
## [29] train-rmse:1.705051
## [30] train-rmse:1.683708
## [31] train-rmse:1.656522
## [32] train-rmse:1.645577
## [33] train-rmse:1.614263
## [34] train-rmse:1.593455
## [35] train-rmse:1.567166
## [36] train-rmse:1.555617
## [37] train-rmse:1.527377
## [38] train-rmse:1.517858
## [39] train-rmse:1.492600
## [40] train-rmse:1.484267
## [41] train-rmse:1.468374
## [42] train-rmse:1.442361
## [43] train-rmse:1.410727
## [44] train-rmse:1.402039
## [45] train-rmse:1.380398
## [46] train-rmse:1.372660
## [47] train-rmse:1.349876
## [48] train-rmse:1.345266
## [49] train-rmse:1.319305
## [50] train-rmse:1.296536
7
pred_y <- stats::predict(final, xgb_test)
MSE <- mean((test_y - pred_y)ˆ2) #mse
MAE <- MAE(test_y, pred_y) #mae
RMSE <- RMSE(test_y, pred_y) #rmse
MSE
## [1] 12.1427
MAE
## [1] 2.513421
RMSE
## [1] 3.484637
The root mean squared error turns out to be 3.484637. This represents the average difference between the
prediction made for the median house values and the actual observed house values in the test set.