Understanding XGBoost Model On Otto Dataset

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4
At a glance
Powered by AI
The document discusses using XGBoost to analyze an Otto product classification dataset and generate visualizations to understand the trained model.

The dataset contains information on products from the Otto Group and the goal is to classify the products into different categories.

XGBoost is an implementation of gradient boosting which trains multiple decision trees sequentially to make predictions. It is known for its speed and accuracy but can be treated as a 'black box' without understanding how the model works.

11/28/2016 UnderstandingXGBoostModelonOttoDataset

UnderstandingXGBoostModelonOttoDataset
MichalBenesty

Introduction
XGBoostisanimplementationofthefamousgradientboostingalgorithm.Thismodelisoftendescribedasablackbox,meaningitworkswell
butitisnottrivialtounderstandhow.Indeed,themodelismadeofhundreds(thousands?)ofdecisiontrees.Youmaywonderhowpossiblea
humanwouldbeabletohaveageneralviewofthemodel?

WhileXGBoostisknownforitsfastspeedandaccuratepredictivepower.Italsocomeswithvariousfunctionstohelpyouunderstandthe
model.ThepurposeofthisRMarkdowndocumentistodemonstratehowwecanleveragethefunctionsalreadyimplementedinXGBoostR
packageforthatpurpose.Ofcourse,everythingshowedbelowcanbeappliedtothedatasetyoumayhavetomanipulateatworkorwherever!

FirstwewilltrainamodelontheOTTOdataset,thenwewillgeneratetwovizualisationstogetaclueofwhatisimportanttothemodel,finally,
wewillseehowwecanleveragetheseinformation.

Preparationofthedata
ThispartisbasedonthetutorialexamplebyTongHe(https://github.com/dmlc/xgboost/blob/master/demo/kaggleotto/otto_train_pred.R)

First,letsloadthepackagesandthedataset.

require(xgboost)

##Loadingrequiredpackage:xgboost

require(methods)

##Loadingrequiredpackage:methods

require(data.table)

##Loadingrequiredpackage:data.table

require(magrittr)

##Loadingrequiredpackage:magrittr

train<fread('../input/train.csv',header=T,stringsAsFactors=F)
test<fread('../input/test.csv',header=TRUE,stringsAsFactors=F)

magrittr and data.table areheretomakethecodecleanerandmorerapid.

Letsseewhatisinthisdataset.

#Traindatasetdimensions
dim(train)

##[1]6187895

#Trainingcontent
train[1:6,1:5,with=F]

##idfeat_1feat_2feat_3feat_4
##1:11000
##2:20000
##3:30000
##4:41001
##5:50000
##6:62100

#Testdatasetdimensions
dim(train)

##[1]6187895

#Testcontent
test[1:6,1:5,with=F]

##idfeat_1feat_2feat_3feat_4
##1:10000
##2:2221416
##3:301121
##4:40001
##5:51001
##6:60000

Weonlydisplaythe6firstrowsand5firstcolumnsforconvenience

Eachcolumnrepresentsafeaturemeasuredbyaninteger.Eachrowisaproduct.

Obviouslythefirstcolumn( ID )doesntcontainanyusefulinformation.Toletthealgorithmfocusonrealstuff,wewilldeletethecolumn.

#DeleteIDcolumnintrainingdataset
train[,id:=NULL]

#DeleteIDcolumnintestingdataset
test[,id:=NULL]

Accordingtothe OTTO challengedescription,wehavehereamulticlassclassificationchallenge.Weneedtoextractthelabels(herethename


ofthedifferentclasses)fromthedataset.Weonlyhavetwofiles(testandtraining),itseemslogicalthatthetrainingfilecontainstheclasswe
arelookingfor.Usuallythelabelsisinthefirstorthelastcolumn.Letscheckthecontentofthelastcolumn.

#Checkthecontentofthelastcolumn
train[1:6,ncol(train),with=F]

##target
##1:Class_1
##2:Class_1
##3:Class_1
##4:Class_1
##5:Class_1
##6:Class_1

#Savethenameofthelastcolumn
nameLastCol<names(train)[ncol(train)]

Theclassareprovidedascharacterstringinthe ncol(train) thcolumncalled nameLastCol .Asyoumayknow,XGBoostdoesntsupport


anythingelsethannumbers.Sowewillconvertclassestointegers.Moreover,accordingtothedocumentation,itshouldstartat0.

Forthatpurpose,wewill:

extractthetargetcolumn
removeClass_fromeachclassname
converttointegers
remove1tothenewvalue

#Converttoclassestonumbers
y<train[,nameLastCol,with=F][[1]]%>%gsub('Class_','',.)%>%{as.integer(.)1}
#Displaythefirst5levels
y[1:5]

##[1]00000

Weremovelabelcolumnfromtrainingdataset,otherwiseXGBoostwoulduseittoguessthelabels!!!

train[,nameLastCol:=NULL,with=F]

data.table isanawesomeimplementationofdata.frame,unfortunatelyitisnotaformatsupportednativelybyXGBoost.Weneedtoconvert
bothdatasets(trainingandtest)innumericMatrixformat.

trainMatrix<train[,lapply(.SD,as.numeric)]%>%as.matrix
testMatrix<test[,lapply(.SD,as.numeric)]%>%as.matrix

Modeltraining
https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 1/4
11/28/2016 UnderstandingXGBoostModelonOttoDataset
Modeltraining
Beforethelearningwewillusethecrossvalidationtoevaluatetheourerrorrate.

BasicallyXGBoostwilldividethetrainingdatain nfold parts,thenXGBoostwillretainthefirstpartanduseitasthetestdata.Thenitwill


reintegratethefirstparttothetrainingdatasetandretainthesecondpart,doatrainingandsoon

Lookatthefunctiondocumentationformoreinformation.

numberOfClasses<max(y)+1

param<list("objective"="multi:softprob",
"eval_metric"="mlogloss",
"num_class"=numberOfClasses)

cv.nround<5
cv.nfold<3

bst.cv=xgb.cv(param=param,data=trainMatrix,label=y,
nfold=cv.nfold,nrounds=cv.nround)

##[0]trainmlogloss:1.539950+0.003540testmlogloss:1.555506+0.001696
##[1]trainmlogloss:1.280621+0.002441testmlogloss:1.304418+0.001335
##[2]trainmlogloss:1.111787+0.003201testmlogloss:1.142924+0.002505
##[3]trainmlogloss:0.991269+0.003233testmlogloss:1.029022+0.002207
##[4]trainmlogloss:0.899486+0.003829testmlogloss:0.942855+0.002007

Aswecanseetheerrorrateislowonthetestdataset(fora5mntrainedmodel).

Finally,wearereadytotraintherealmodel!!!

nround=50
bst=xgboost(param=param,data=trainMatrix,label=y,nrounds=nround)

##[0]trainmlogloss:1.539929
##[1]trainmlogloss:1.284352
##[2]trainmlogloss:1.116242
##[3]trainmlogloss:0.997410
##[4]trainmlogloss:0.908786
##[5]trainmlogloss:0.837502
##[6]trainmlogloss:0.780620
##[7]trainmlogloss:0.735472
##[8]trainmlogloss:0.696930
##[9]trainmlogloss:0.666730
##[10]trainmlogloss:0.641023
##[11]trainmlogloss:0.618734
##[12]trainmlogloss:0.599407
##[13]trainmlogloss:0.583202
##[14]trainmlogloss:0.568400
##[15]trainmlogloss:0.555463
##[16]trainmlogloss:0.543348
##[17]trainmlogloss:0.532382
##[18]trainmlogloss:0.522701
##[19]trainmlogloss:0.513794
##[20]trainmlogloss:0.506249
##[21]trainmlogloss:0.497970
##[22]trainmlogloss:0.491400
##[23]trainmlogloss:0.484099
##[24]trainmlogloss:0.477010
##[25]trainmlogloss:0.470935
##[26]trainmlogloss:0.466101
##[27]trainmlogloss:0.461392
##[28]trainmlogloss:0.456607
##[29]trainmlogloss:0.450932
##[30]trainmlogloss:0.446368
##[31]trainmlogloss:0.442488
##[32]trainmlogloss:0.437648
##[33]trainmlogloss:0.433682
##[34]trainmlogloss:0.428969
##[35]trainmlogloss:0.424687
##[36]trainmlogloss:0.421398
##[37]trainmlogloss:0.418917
##[38]trainmlogloss:0.415504
##[39]trainmlogloss:0.411823
##[40]trainmlogloss:0.407470
##[41]trainmlogloss:0.404227
##[42]trainmlogloss:0.401174
##[43]trainmlogloss:0.397705
##[44]trainmlogloss:0.394443
##[45]trainmlogloss:0.392279
##[46]trainmlogloss:0.389940
##[47]trainmlogloss:0.387887
##[48]trainmlogloss:0.385097
##[49]trainmlogloss:0.382814

Modelunderstanding
Featureimportance
Sofar,wehavebuiltamodelmadeof nround trees.

Tobuildatree,thedatasetisdividedrecursivelyseveraltimes.Attheendoftheprocess,yougetgroupsofobservations(here,these
observationsarepropertiesregardingOTTOproducts).

Eachdivisionoperationiscalledasplit.

Eachgroupateachdivisionleveliscalledabranchandthedeepestleveliscalledaleaf.

Inthefinalmodel,theseleafsaresupposedtobeaspureaspossibleforeachtree,meaninginourcasethateachleafshouldbemadeofone
classofOTTOproductonly(ofcourseitisnottrue,butthatswhatwetrytoachieveinaminimumofsplits).

Notallsplitsareequallyimportant.Basicallythefirstsplitofatreewillhavemoreimpactonthepuritythat,forinstance,thedeepestsplit.
Intuitively,weunderstandthatthefirstsplitmakesmostofthework,andthefollowingsplitsfocusonsmallerpartsofthedatasetwhichhave
beenmissclassifiedbythefirsttree.

Inthesameway,inBoostingwetrytooptimizethemissclassificationateachround(itiscalledtheloss).Sothefirsttreewilldothebigwork
andthefollowingtreeswillfocusontheremaining,onthepartsnotcorrectlylearnedbytheprevioustrees.

Theimprovementbroughtbyeachsplitcanbemeasured,itisthegain.

Eachsplitisdoneononefeatureonlyatonevalue.

Letsseewhatthemodellookslike.

https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 2/4
11/28/2016 UnderstandingXGBoostModelonOttoDataset
Letsseewhatthemodellookslike.

model<xgb.dump(bst,with.stats=T)
model[1:10]

##[1]"booster[0]"
##[2]"0:[f16<1.5]yes=1,no=2,missing=1,gain=309.719,cover=12222.8"
##[3]"1:[f29<26.5]yes=3,no=4,missing=3,gain=161.964,cover=11424"
##[4]"3:[f77<2.5]yes=7,no=8,missing=7,gain=106.092,cover=11416.3"
##[5]"7:[f52<12.5]yes=13,no=14,missing=13,gain=43.1389,cover=11211.9"
##[6]"13:[f76<1.5]yes=25,no=26,missing=25,gain=37.407,cover=11143.5"
##[7]"25:[f16<2.00001]yes=49,no=50,missing=50,gain=36.3329,cover=10952.1"
##[8]"49:leaf=0.0905567,cover=1090.77"
##[9]"50:leaf=0.148413,cover=9861.33"
##[10]"26:[f83<26]yes=51,no=52,missing=52,gain=167.766,cover=191.407"

Forconvenience,wearedisplayingthefirst10linesofthemodelonly.

Clearly,itisnoteasytounderstandwhatitmeans.

Basicallyeachlinerepresentsabranch,thereisthetreeID,thefeatureID,thepointwhereitsplits,andinformationregardingthenext
branches(left,right,whentherowforthisfeatureisN/A).

Hopefully,XGBoostoffersabetterrepresentation:featureimportance.

Featureimportanceisaboutaveragingthegainofeachfeatureforallsplitandalltrees.

Thenwecanusethefunction xgb.plot.importance .

#Getthefeaturerealnames
names<dimnames(trainMatrix)[[2]]

#Computefeatureimportancematrix
importance_matrix<xgb.importance(names,model=bst)

#Nicegraph
xgb.plot.importance(importance_matrix[1:10,])

>Tomakeitunderstandablewefirst

extractthecolumnnamesfromthe Matrix .

Interpretation
Inthefeatureimportanceabove,wecanseethefirst10mostimportantfeatures.

Thisfunctiongivesacolortoeachbar.BasicallyaKmeansclusteringisappliedtogroupeachfeaturebyimportance.

Fromhereyoucantakeseveralactions.Forinstanceyoucanremovethelessimportantfeature(featureselectionprocess),orgodeeperin
theinteractionbetweenthemostimportantfeaturesandlabels.

Oryoucanjustreasonaboutwhythesefeaturesaresoimportat(inOTTOchallengewecantgothiswaybecausethereisnotenough
information).

Treegraph
Featureimportancegivesyoufeatureweightinformationbutnotinteractionbetweenfeatures.

XGBoostRpackagehaveanotherusefulfunctionforthat.

xgb.plot.tree(feature_names=names,model=bst,n_first_tree=2)

>=7.5 Leaf

feat_60
>=28 Cover:120.691
Gain:27.921
<7.5 Leaf
feat_25
>=36 Cover:214.123
Gain:75.7615
>=1.5 Leaf
feat_79
<28 Cover:93.4321
Gain:68.6021
<1.5 Leaf

>=160 feat_42

>=8.00001 Leaf
feat_23
>=1.5 Cover:4.74074
Gain:3.15099
<8.00001 Leaf
feat_7
<36 Cover:69.5309
Gain:6.9492

feat_58 >=2.5 Leaf


>=46 Cover:357.333
Gain:79.5287 feat_17
<1.5 Cover:64.7901
Gain:3.43573

>=1.5 Leaf <2.5 Leaf


feat_38
>=3.5 Cover:5.1358
Gain:11.0242
<1.5 Leaf >=12.5 Leaf

<160 feat_78 feat_83


>=3.5 Cover:7.7037
Gain:6.11676
<12.5 Leaf

feat_36
<3.5 Cover:68.5432
Gain:6.38999
>=2.5 Leaf

feat_10
<3.5 Cover:60.8395
Gain:0.8211
feat_16
>=1.5 Cover:798.815
Gain:109.771

>=1.5 Leaf
feat_4
>=2.5 Cover:4.14815 <2.5 Leaf
Gain:1.49291
<1.5 Leaf

>=1.5 >=4.5 Leaf


feat_68

>=13.5 Leaf

feat_32
<2.5 Cover:93.4321
Gain:0.73687
feat_37
<13.5 Cover:92.2469
Gain:0.756948
<4.5 Leaf

>=3.5 Leaf
feat_80
>=100 Cover:15.4074
Gain:15.9666
feat_35
<46 Cover:441.481 <3.5 Leaf
Gain:17.3372

>=3.5 feat_58
feat_17
Cover:12222.8
Gain:309.719 >=1.5 Leaf

feat_74
<100 Cover:9.87654
Gain:2.11749
<1.5 Leaf
feat_78
<1.5 Cover:343.901
Gain:19.1588
>=42 Leaf

feat_60
>=2.5 Cover:82.5679
Gain:1.82279
<42 Leaf

<3.5 feat_87

>=4.5 Leaf

feat_92
<2.5 Cover:236.049
Gain:11.2664
<4.5 Leaf

>=26.5 Leaf

>=32 Leaf

feat_2
>=4.5 Cover:49.5802
Gain:59.7322
<32 Leaf

>=7 feat_76

>=1.5 Leaf
feat_79
feat_40 <4.5 Cover:60.2469
>=2.5 Cover:204.444 Gain:8.51257
Gain:43.3934
<1.5 Leaf

>=7.5 Leaf

>=1.5 Leaf
<7 feat_8

feat_92
<7.5 Cover:93.037
Gain:5.46521
<1.5 Leaf

feat_30
<1.5 Cover:11424
Gain:161.964 >=4.00001 Leaf
feat_78
<26.5 Cover:11416.3 feat_11
Gain:106.092 >=8.5 Cover:10.8642
Gain:0.624697
<4.00001 Leaf

>=12.5 feat_79

>=2.5 Leaf
feat_75
<8.5 Cover:57.4815
Gain:7.57482
<2.5 Leaf

feat_53
<2.5 Cover:11211.9
Gain:43.1389
>=26 Leaf

feat_84
>=1.5 Cover:191.407
Gain:167.766
<26 Leaf

<12.5 feat_77

>=2.00001 Leaf

feat_17
<1.5 Cover:10952.1
Gain:36.3329
<2.00001 Leaf

>=1.5 Leaf

feat_5
>=8.5 Cover:22.9136
Gain:0.712676
<1.5 Leaf
feat_56
>=2.5 Cover:116.346
Gain:39.3733
>=5.5 Leaf

feat_40
<8.5 Cover:93.4321
Gain:24.9656
<5.5 Leaf

>=38 feat_43

>=2.5 Leaf

feat_8
>=1.5 Cover:279.901
Gain:138.309
<2.5 Leaf

feat_48
<2.5 Cover:1084.05
Gain:179.358
>=9.5 Leaf

feat_40
<1.5 Cover:804.148
Gain:144.059
<9.5 Leaf

feat_32
>=4.5 Cover:1387.26
Gain:432.441
>=88 Leaf
feat_76
>=1.5 Cover:14.6173
Gain:27.1344
<88 Leaf

feat_15
>=4.5 Cover:87.1111
Gain:24.079
>=4.5 Leaf

feat_25
<1.5 Cover:72.4938
Gain:3.34127
<4.5 Leaf

<38 feat_24

>=12 Leaf
feat_26
>=1.5 Cover:45.6296
Gain:19.8635
<12 Leaf

feat_40
<4.5 Cover:99.7531
Gain:40.9895
>=1.5 Leaf

feat_15
<1.5 Cover:54.1235
Gain:19.9
<1.5 Leaf
feat_67
>=2.5 Cover:2557.83
Gain:761.2
>=2.5 Leaf

feat_24
>=1.5 Cover:27.4568
Gain:26.9149
<2.5 Leaf
feat_32
>=1.5 Cover:738.173
Gain:144.812
>=5.5 Leaf

feat_88
<1.5 Cover:710.716
Gain:109.712
<5.5 Leaf

>=30 feat_48

>=46 Leaf

feat_43
>=34 Cover:285.432
Gain:68.8393
<46 Leaf

feat_26
<1.5 Cover:314.667
Gain:129.379
>=12.5 Leaf
feat_15
<34 Cover:29.2346
Gain:14.5476
<12.5 Leaf

feat_29
<4.5 Cover:1170.57
Gain:600.26
>=4.5 Leaf

feat_70
>=26 Cover:35.358
Gain:39.7031
<4.5 Leaf

feat_13
>=54 Cover:53.5309
Gain:71.8355
>=2.5 Leaf
feat_48
<26 Cover:18.1728
Gain:10.1724
<2.5 Leaf

<30 feat_55

>=1.5 Leaf

feat_59
>=54 Cover:49.7778
Gain:3.20673
<1.5 Leaf

feat_15
<54 Cover:64.1975
Gain:20.4193
>=7.5 Leaf
feat_40
<54 Cover:14.4198
Gain:21.6375
<7.5 Leaf

feat_14
Cover:12222.8
Gain:8073.58
>=9.5 Leaf

feat_24
>=7.5 Cover:53.1358
Gain:19.499
<9.5 Leaf
feat_64
>=92 Cover:667.259
Gain:109.335
>=44 Leaf

feat_60
<7.5 Cover:614.123
Gain:104.974
<44 Leaf

>=4.5 feat_80

>=1.5 Leaf
feat_42
>=4.5 Cover:24.0988
Gain:15.0735
<1.5 Leaf
feat_72
<92 Cover:114.765
Gain:51.8412
>=9.5 Leaf

feat_64
<4.5 Cover:90.6667
Gain:18.7212
<9.5 Leaf

feat_24
>=1.5 Cover:1634.96
Gain:743.748
>=2.5 Leaf
feat_60
>=32 Cover:343.309
Gain:49.662
<2.5 Leaf

>=2.5 feat_43

>=17.5 Leaf

feat_70
<32 Cover:57.0864
Gain:28.7306
<17.5 Leaf

feat_72
<4.5 Cover:852.938
Gain:509.869
>=2.5 Leaf
feat_86
>=14.5 Cover:97.7778
Gain:24.3496
<2.5 Leaf

<2.5 feat_67

>=3.5 Leaf

feat_34
<14.5 Cover:354.765
Gain:120.467
<3.5 Leaf
feat_15
<2.5 Cover:9664.99
Gain:5621.16
>=1.5 Leaf
feat_36
>=1.5 Cover:104.099
Gain:58.8936
<1.5 Leaf

https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 3/4
11/28/2016 UnderstandingXGBoostModelonOttoDataset
<1.5 Leaf

>=2.5 feat_25

>=48 Leaf

feat_60
<1.5 Cover:173.037
Gain:53.0866
<48 Leaf

feat_40
>=17.5 Cover:5410.57
Gain:528.098
>=10 Leaf
feat_66
>=4.5 Cover:33.5802
Gain:43.4831
<10 Leaf

<2.5 feat_72

>=5.5 Leaf
feat_64
<4.5 Cover:5099.85
Gain:359.585
<5.5 Leaf
feat_86
<1.5 Cover:8030.02
Gain:1533.03
>=3.5 Leaf
feat_62
>=1.5 Cover:540.642
Gain:186.458
<3.5 Leaf

>=80 feat_40

>=4.5 Leaf

feat_9
<1.5 Cover:1463.11
Gain:282.842
<4.5 Leaf

feat_60
<17.5 Cover:2619.46
Gain:663.816
>=68 Leaf

feat_36
>=11.5 Cover:5.33333
Gain:0.543403
<68 Leaf

<80 feat_66

>=6.5 Leaf
feat_9
<11.5 Cover:610.37
Gain:57.0833
<6.5 Leaf

Wearejustdisplayingthefirsttwotreeshere.

Onsimplemodelsthefirsttwotreesmaybeenough.Here,itmightnotbethecase.Wecanseefromthesizeofthetreesthattheintersaction
betweenfeaturesiscomplicated.Besides,XGBoostgenerate k treesateachroundfora k classificationproblem.Thereforethetwotrees
illustratedherearetryingtoclassifydataintodifferentclasses.

https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 4/4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy