Understanding XGBoost Model On Otto Dataset
Understanding XGBoost Model On Otto Dataset
Understanding XGBoost Model On Otto Dataset
UnderstandingXGBoostModelonOttoDataset
MichalBenesty
Introduction
XGBoostisanimplementationofthefamousgradientboostingalgorithm.Thismodelisoftendescribedasablackbox,meaningitworkswell
butitisnottrivialtounderstandhow.Indeed,themodelismadeofhundreds(thousands?)ofdecisiontrees.Youmaywonderhowpossiblea
humanwouldbeabletohaveageneralviewofthemodel?
WhileXGBoostisknownforitsfastspeedandaccuratepredictivepower.Italsocomeswithvariousfunctionstohelpyouunderstandthe
model.ThepurposeofthisRMarkdowndocumentistodemonstratehowwecanleveragethefunctionsalreadyimplementedinXGBoostR
packageforthatpurpose.Ofcourse,everythingshowedbelowcanbeappliedtothedatasetyoumayhavetomanipulateatworkorwherever!
FirstwewilltrainamodelontheOTTOdataset,thenwewillgeneratetwovizualisationstogetaclueofwhatisimportanttothemodel,finally,
wewillseehowwecanleveragetheseinformation.
Preparationofthedata
ThispartisbasedonthetutorialexamplebyTongHe(https://github.com/dmlc/xgboost/blob/master/demo/kaggleotto/otto_train_pred.R)
First,letsloadthepackagesandthedataset.
require(xgboost)
##Loadingrequiredpackage:xgboost
require(methods)
##Loadingrequiredpackage:methods
require(data.table)
##Loadingrequiredpackage:data.table
require(magrittr)
##Loadingrequiredpackage:magrittr
train<fread('../input/train.csv',header=T,stringsAsFactors=F)
test<fread('../input/test.csv',header=TRUE,stringsAsFactors=F)
Letsseewhatisinthisdataset.
#Traindatasetdimensions
dim(train)
##[1]6187895
#Trainingcontent
train[1:6,1:5,with=F]
##idfeat_1feat_2feat_3feat_4
##1:11000
##2:20000
##3:30000
##4:41001
##5:50000
##6:62100
#Testdatasetdimensions
dim(train)
##[1]6187895
#Testcontent
test[1:6,1:5,with=F]
##idfeat_1feat_2feat_3feat_4
##1:10000
##2:2221416
##3:301121
##4:40001
##5:51001
##6:60000
Weonlydisplaythe6firstrowsand5firstcolumnsforconvenience
Eachcolumnrepresentsafeaturemeasuredbyaninteger.Eachrowisaproduct.
Obviouslythefirstcolumn( ID )doesntcontainanyusefulinformation.Toletthealgorithmfocusonrealstuff,wewilldeletethecolumn.
#DeleteIDcolumnintrainingdataset
train[,id:=NULL]
#DeleteIDcolumnintestingdataset
test[,id:=NULL]
#Checkthecontentofthelastcolumn
train[1:6,ncol(train),with=F]
##target
##1:Class_1
##2:Class_1
##3:Class_1
##4:Class_1
##5:Class_1
##6:Class_1
#Savethenameofthelastcolumn
nameLastCol<names(train)[ncol(train)]
Forthatpurpose,wewill:
extractthetargetcolumn
removeClass_fromeachclassname
converttointegers
remove1tothenewvalue
#Converttoclassestonumbers
y<train[,nameLastCol,with=F][[1]]%>%gsub('Class_','',.)%>%{as.integer(.)1}
#Displaythefirst5levels
y[1:5]
##[1]00000
Weremovelabelcolumnfromtrainingdataset,otherwiseXGBoostwoulduseittoguessthelabels!!!
train[,nameLastCol:=NULL,with=F]
data.table isanawesomeimplementationofdata.frame,unfortunatelyitisnotaformatsupportednativelybyXGBoost.Weneedtoconvert
bothdatasets(trainingandtest)innumericMatrixformat.
trainMatrix<train[,lapply(.SD,as.numeric)]%>%as.matrix
testMatrix<test[,lapply(.SD,as.numeric)]%>%as.matrix
Modeltraining
https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 1/4
11/28/2016 UnderstandingXGBoostModelonOttoDataset
Modeltraining
Beforethelearningwewillusethecrossvalidationtoevaluatetheourerrorrate.
Lookatthefunctiondocumentationformoreinformation.
numberOfClasses<max(y)+1
param<list("objective"="multi:softprob",
"eval_metric"="mlogloss",
"num_class"=numberOfClasses)
cv.nround<5
cv.nfold<3
bst.cv=xgb.cv(param=param,data=trainMatrix,label=y,
nfold=cv.nfold,nrounds=cv.nround)
##[0]trainmlogloss:1.539950+0.003540testmlogloss:1.555506+0.001696
##[1]trainmlogloss:1.280621+0.002441testmlogloss:1.304418+0.001335
##[2]trainmlogloss:1.111787+0.003201testmlogloss:1.142924+0.002505
##[3]trainmlogloss:0.991269+0.003233testmlogloss:1.029022+0.002207
##[4]trainmlogloss:0.899486+0.003829testmlogloss:0.942855+0.002007
Aswecanseetheerrorrateislowonthetestdataset(fora5mntrainedmodel).
Finally,wearereadytotraintherealmodel!!!
nround=50
bst=xgboost(param=param,data=trainMatrix,label=y,nrounds=nround)
##[0]trainmlogloss:1.539929
##[1]trainmlogloss:1.284352
##[2]trainmlogloss:1.116242
##[3]trainmlogloss:0.997410
##[4]trainmlogloss:0.908786
##[5]trainmlogloss:0.837502
##[6]trainmlogloss:0.780620
##[7]trainmlogloss:0.735472
##[8]trainmlogloss:0.696930
##[9]trainmlogloss:0.666730
##[10]trainmlogloss:0.641023
##[11]trainmlogloss:0.618734
##[12]trainmlogloss:0.599407
##[13]trainmlogloss:0.583202
##[14]trainmlogloss:0.568400
##[15]trainmlogloss:0.555463
##[16]trainmlogloss:0.543348
##[17]trainmlogloss:0.532382
##[18]trainmlogloss:0.522701
##[19]trainmlogloss:0.513794
##[20]trainmlogloss:0.506249
##[21]trainmlogloss:0.497970
##[22]trainmlogloss:0.491400
##[23]trainmlogloss:0.484099
##[24]trainmlogloss:0.477010
##[25]trainmlogloss:0.470935
##[26]trainmlogloss:0.466101
##[27]trainmlogloss:0.461392
##[28]trainmlogloss:0.456607
##[29]trainmlogloss:0.450932
##[30]trainmlogloss:0.446368
##[31]trainmlogloss:0.442488
##[32]trainmlogloss:0.437648
##[33]trainmlogloss:0.433682
##[34]trainmlogloss:0.428969
##[35]trainmlogloss:0.424687
##[36]trainmlogloss:0.421398
##[37]trainmlogloss:0.418917
##[38]trainmlogloss:0.415504
##[39]trainmlogloss:0.411823
##[40]trainmlogloss:0.407470
##[41]trainmlogloss:0.404227
##[42]trainmlogloss:0.401174
##[43]trainmlogloss:0.397705
##[44]trainmlogloss:0.394443
##[45]trainmlogloss:0.392279
##[46]trainmlogloss:0.389940
##[47]trainmlogloss:0.387887
##[48]trainmlogloss:0.385097
##[49]trainmlogloss:0.382814
Modelunderstanding
Featureimportance
Sofar,wehavebuiltamodelmadeof nround trees.
Tobuildatree,thedatasetisdividedrecursivelyseveraltimes.Attheendoftheprocess,yougetgroupsofobservations(here,these
observationsarepropertiesregardingOTTOproducts).
Eachdivisionoperationiscalledasplit.
Eachgroupateachdivisionleveliscalledabranchandthedeepestleveliscalledaleaf.
Inthefinalmodel,theseleafsaresupposedtobeaspureaspossibleforeachtree,meaninginourcasethateachleafshouldbemadeofone
classofOTTOproductonly(ofcourseitisnottrue,butthatswhatwetrytoachieveinaminimumofsplits).
Notallsplitsareequallyimportant.Basicallythefirstsplitofatreewillhavemoreimpactonthepuritythat,forinstance,thedeepestsplit.
Intuitively,weunderstandthatthefirstsplitmakesmostofthework,andthefollowingsplitsfocusonsmallerpartsofthedatasetwhichhave
beenmissclassifiedbythefirsttree.
Inthesameway,inBoostingwetrytooptimizethemissclassificationateachround(itiscalledtheloss).Sothefirsttreewilldothebigwork
andthefollowingtreeswillfocusontheremaining,onthepartsnotcorrectlylearnedbytheprevioustrees.
Theimprovementbroughtbyeachsplitcanbemeasured,itisthegain.
Eachsplitisdoneononefeatureonlyatonevalue.
Letsseewhatthemodellookslike.
https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 2/4
11/28/2016 UnderstandingXGBoostModelonOttoDataset
Letsseewhatthemodellookslike.
model<xgb.dump(bst,with.stats=T)
model[1:10]
##[1]"booster[0]"
##[2]"0:[f16<1.5]yes=1,no=2,missing=1,gain=309.719,cover=12222.8"
##[3]"1:[f29<26.5]yes=3,no=4,missing=3,gain=161.964,cover=11424"
##[4]"3:[f77<2.5]yes=7,no=8,missing=7,gain=106.092,cover=11416.3"
##[5]"7:[f52<12.5]yes=13,no=14,missing=13,gain=43.1389,cover=11211.9"
##[6]"13:[f76<1.5]yes=25,no=26,missing=25,gain=37.407,cover=11143.5"
##[7]"25:[f16<2.00001]yes=49,no=50,missing=50,gain=36.3329,cover=10952.1"
##[8]"49:leaf=0.0905567,cover=1090.77"
##[9]"50:leaf=0.148413,cover=9861.33"
##[10]"26:[f83<26]yes=51,no=52,missing=52,gain=167.766,cover=191.407"
Forconvenience,wearedisplayingthefirst10linesofthemodelonly.
Clearly,itisnoteasytounderstandwhatitmeans.
Basicallyeachlinerepresentsabranch,thereisthetreeID,thefeatureID,thepointwhereitsplits,andinformationregardingthenext
branches(left,right,whentherowforthisfeatureisN/A).
Hopefully,XGBoostoffersabetterrepresentation:featureimportance.
Featureimportanceisaboutaveragingthegainofeachfeatureforallsplitandalltrees.
Thenwecanusethefunction xgb.plot.importance .
#Getthefeaturerealnames
names<dimnames(trainMatrix)[[2]]
#Computefeatureimportancematrix
importance_matrix<xgb.importance(names,model=bst)
#Nicegraph
xgb.plot.importance(importance_matrix[1:10,])
>Tomakeitunderstandablewefirst
extractthecolumnnamesfromthe Matrix .
Interpretation
Inthefeatureimportanceabove,wecanseethefirst10mostimportantfeatures.
Thisfunctiongivesacolortoeachbar.BasicallyaKmeansclusteringisappliedtogroupeachfeaturebyimportance.
Fromhereyoucantakeseveralactions.Forinstanceyoucanremovethelessimportantfeature(featureselectionprocess),orgodeeperin
theinteractionbetweenthemostimportantfeaturesandlabels.
Oryoucanjustreasonaboutwhythesefeaturesaresoimportat(inOTTOchallengewecantgothiswaybecausethereisnotenough
information).
Treegraph
Featureimportancegivesyoufeatureweightinformationbutnotinteractionbetweenfeatures.
XGBoostRpackagehaveanotherusefulfunctionforthat.
xgb.plot.tree(feature_names=names,model=bst,n_first_tree=2)
>=7.5 Leaf
feat_60
>=28 Cover:120.691
Gain:27.921
<7.5 Leaf
feat_25
>=36 Cover:214.123
Gain:75.7615
>=1.5 Leaf
feat_79
<28 Cover:93.4321
Gain:68.6021
<1.5 Leaf
>=160 feat_42
>=8.00001 Leaf
feat_23
>=1.5 Cover:4.74074
Gain:3.15099
<8.00001 Leaf
feat_7
<36 Cover:69.5309
Gain:6.9492
feat_36
<3.5 Cover:68.5432
Gain:6.38999
>=2.5 Leaf
feat_10
<3.5 Cover:60.8395
Gain:0.8211
feat_16
>=1.5 Cover:798.815
Gain:109.771
>=1.5 Leaf
feat_4
>=2.5 Cover:4.14815 <2.5 Leaf
Gain:1.49291
<1.5 Leaf
>=13.5 Leaf
feat_32
<2.5 Cover:93.4321
Gain:0.73687
feat_37
<13.5 Cover:92.2469
Gain:0.756948
<4.5 Leaf
>=3.5 Leaf
feat_80
>=100 Cover:15.4074
Gain:15.9666
feat_35
<46 Cover:441.481 <3.5 Leaf
Gain:17.3372
>=3.5 feat_58
feat_17
Cover:12222.8
Gain:309.719 >=1.5 Leaf
feat_74
<100 Cover:9.87654
Gain:2.11749
<1.5 Leaf
feat_78
<1.5 Cover:343.901
Gain:19.1588
>=42 Leaf
feat_60
>=2.5 Cover:82.5679
Gain:1.82279
<42 Leaf
<3.5 feat_87
>=4.5 Leaf
feat_92
<2.5 Cover:236.049
Gain:11.2664
<4.5 Leaf
>=26.5 Leaf
>=32 Leaf
feat_2
>=4.5 Cover:49.5802
Gain:59.7322
<32 Leaf
>=7 feat_76
>=1.5 Leaf
feat_79
feat_40 <4.5 Cover:60.2469
>=2.5 Cover:204.444 Gain:8.51257
Gain:43.3934
<1.5 Leaf
>=7.5 Leaf
>=1.5 Leaf
<7 feat_8
feat_92
<7.5 Cover:93.037
Gain:5.46521
<1.5 Leaf
feat_30
<1.5 Cover:11424
Gain:161.964 >=4.00001 Leaf
feat_78
<26.5 Cover:11416.3 feat_11
Gain:106.092 >=8.5 Cover:10.8642
Gain:0.624697
<4.00001 Leaf
>=12.5 feat_79
>=2.5 Leaf
feat_75
<8.5 Cover:57.4815
Gain:7.57482
<2.5 Leaf
feat_53
<2.5 Cover:11211.9
Gain:43.1389
>=26 Leaf
feat_84
>=1.5 Cover:191.407
Gain:167.766
<26 Leaf
<12.5 feat_77
>=2.00001 Leaf
feat_17
<1.5 Cover:10952.1
Gain:36.3329
<2.00001 Leaf
>=1.5 Leaf
feat_5
>=8.5 Cover:22.9136
Gain:0.712676
<1.5 Leaf
feat_56
>=2.5 Cover:116.346
Gain:39.3733
>=5.5 Leaf
feat_40
<8.5 Cover:93.4321
Gain:24.9656
<5.5 Leaf
>=38 feat_43
>=2.5 Leaf
feat_8
>=1.5 Cover:279.901
Gain:138.309
<2.5 Leaf
feat_48
<2.5 Cover:1084.05
Gain:179.358
>=9.5 Leaf
feat_40
<1.5 Cover:804.148
Gain:144.059
<9.5 Leaf
feat_32
>=4.5 Cover:1387.26
Gain:432.441
>=88 Leaf
feat_76
>=1.5 Cover:14.6173
Gain:27.1344
<88 Leaf
feat_15
>=4.5 Cover:87.1111
Gain:24.079
>=4.5 Leaf
feat_25
<1.5 Cover:72.4938
Gain:3.34127
<4.5 Leaf
<38 feat_24
>=12 Leaf
feat_26
>=1.5 Cover:45.6296
Gain:19.8635
<12 Leaf
feat_40
<4.5 Cover:99.7531
Gain:40.9895
>=1.5 Leaf
feat_15
<1.5 Cover:54.1235
Gain:19.9
<1.5 Leaf
feat_67
>=2.5 Cover:2557.83
Gain:761.2
>=2.5 Leaf
feat_24
>=1.5 Cover:27.4568
Gain:26.9149
<2.5 Leaf
feat_32
>=1.5 Cover:738.173
Gain:144.812
>=5.5 Leaf
feat_88
<1.5 Cover:710.716
Gain:109.712
<5.5 Leaf
>=30 feat_48
>=46 Leaf
feat_43
>=34 Cover:285.432
Gain:68.8393
<46 Leaf
feat_26
<1.5 Cover:314.667
Gain:129.379
>=12.5 Leaf
feat_15
<34 Cover:29.2346
Gain:14.5476
<12.5 Leaf
feat_29
<4.5 Cover:1170.57
Gain:600.26
>=4.5 Leaf
feat_70
>=26 Cover:35.358
Gain:39.7031
<4.5 Leaf
feat_13
>=54 Cover:53.5309
Gain:71.8355
>=2.5 Leaf
feat_48
<26 Cover:18.1728
Gain:10.1724
<2.5 Leaf
<30 feat_55
>=1.5 Leaf
feat_59
>=54 Cover:49.7778
Gain:3.20673
<1.5 Leaf
feat_15
<54 Cover:64.1975
Gain:20.4193
>=7.5 Leaf
feat_40
<54 Cover:14.4198
Gain:21.6375
<7.5 Leaf
feat_14
Cover:12222.8
Gain:8073.58
>=9.5 Leaf
feat_24
>=7.5 Cover:53.1358
Gain:19.499
<9.5 Leaf
feat_64
>=92 Cover:667.259
Gain:109.335
>=44 Leaf
feat_60
<7.5 Cover:614.123
Gain:104.974
<44 Leaf
>=4.5 feat_80
>=1.5 Leaf
feat_42
>=4.5 Cover:24.0988
Gain:15.0735
<1.5 Leaf
feat_72
<92 Cover:114.765
Gain:51.8412
>=9.5 Leaf
feat_64
<4.5 Cover:90.6667
Gain:18.7212
<9.5 Leaf
feat_24
>=1.5 Cover:1634.96
Gain:743.748
>=2.5 Leaf
feat_60
>=32 Cover:343.309
Gain:49.662
<2.5 Leaf
>=2.5 feat_43
>=17.5 Leaf
feat_70
<32 Cover:57.0864
Gain:28.7306
<17.5 Leaf
feat_72
<4.5 Cover:852.938
Gain:509.869
>=2.5 Leaf
feat_86
>=14.5 Cover:97.7778
Gain:24.3496
<2.5 Leaf
<2.5 feat_67
>=3.5 Leaf
feat_34
<14.5 Cover:354.765
Gain:120.467
<3.5 Leaf
feat_15
<2.5 Cover:9664.99
Gain:5621.16
>=1.5 Leaf
feat_36
>=1.5 Cover:104.099
Gain:58.8936
<1.5 Leaf
https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 3/4
11/28/2016 UnderstandingXGBoostModelonOttoDataset
<1.5 Leaf
>=2.5 feat_25
>=48 Leaf
feat_60
<1.5 Cover:173.037
Gain:53.0866
<48 Leaf
feat_40
>=17.5 Cover:5410.57
Gain:528.098
>=10 Leaf
feat_66
>=4.5 Cover:33.5802
Gain:43.4831
<10 Leaf
<2.5 feat_72
>=5.5 Leaf
feat_64
<4.5 Cover:5099.85
Gain:359.585
<5.5 Leaf
feat_86
<1.5 Cover:8030.02
Gain:1533.03
>=3.5 Leaf
feat_62
>=1.5 Cover:540.642
Gain:186.458
<3.5 Leaf
>=80 feat_40
>=4.5 Leaf
feat_9
<1.5 Cover:1463.11
Gain:282.842
<4.5 Leaf
feat_60
<17.5 Cover:2619.46
Gain:663.816
>=68 Leaf
feat_36
>=11.5 Cover:5.33333
Gain:0.543403
<68 Leaf
<80 feat_66
>=6.5 Leaf
feat_9
<11.5 Cover:610.37
Gain:57.0833
<6.5 Leaf
Wearejustdisplayingthefirsttwotreeshere.
Onsimplemodelsthefirsttwotreesmaybeenough.Here,itmightnotbethecase.Wecanseefromthesizeofthetreesthattheintersaction
betweenfeaturesiscomplicated.Besides,XGBoostgenerate k treesateachroundfora k classificationproblem.Thereforethetwotrees
illustratedherearetryingtoclassifydataintodifferentclasses.
https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 4/4