Predictive Analytics
Predictive Analytics
Predictive Analytics
net/publication/324404942
CITATION READS
1 1,677
2 authors:
Some of the authors of this publication are also working on these related projects:
IT tools and services for analysis of various types of processes View project
All content following this page was uploaded by Martin Sarnovsky on 30 April 2018.
ABSTRACT
The work presented in this paper is focused on creating of predictive models that help in the process of incident resolution and
implementation of IT infrastructure changes to increase the overall support of IT management. Our main objective was to build the
predictive models using machine learning algorithms and CRISP-DM methodology. We used the incident and related changes
database obtained from the IT environment of the Rabobank Group company, which contained information about the processing of
the incidents during the incident management process. We decided to investigate the dependencies between the incident observation
on particular infrastructure component and the actual source of the incident as well as the dependency between the incidents and
related changes in the infrastructure. We used Random Forests and Gradient Boosting Machine classifiers in the process of
identification of incident source as well as in the prediction of possible impact of the observed incident. Both types of models were
tested on testing set and evaluated using defined metrics.
1. IT SERVICE MANAGEMENT AND INCIDENT accepted standard for managing of information services
MANAGEMENT PROCESS technology. Currently, ITIL consists of five parts, each
corresponding to the particular phase in the IT service life
To precisely define what a service management is, it is cycle. Service Strategy [5] provides a practical framework
needed to specify what a service is, or more concrete, to design, develop and implement service management not
what is the IT service. According to [1] a service is a only from an organizational point of view but also from a
means of delivering value to customers by facilitating source of strategic advantage. The strategy of the service
outcomes customers want to achieve without the provider must be based on the fact, that the customer does
ownership of specific risks and costs. This definition of not buy products, but tries to satisfy specific needs. The
service is rather general. Speaking of IT services, we will provider must understand the broader context of current
consider the services that in some way facilitate ICT and potential markets where it operates or intends to
technologies to its use. IT service can be then considered provide such services. Service Design [6] phase aims to
as one or more IT systems and mechanisms, which enable design the services to meet agreed outcomes. Service is
business processes of the organization. To ensure, that the designed including its components and complemented
IT services satisfy the customer’s needs and to use with additional data like functional and operational
corresponding ICT technologies effectively, these must be requirements, acceptance criteria and plans for the
put into specialized management processes. This deployment of services in operation. Service Transition
discipline is called IT Service Management (ITSM) [2], [7] describes the life-cycle phase of transition of the
and it is defined as a set of specialized organizational service into the live environment. It combines procedures
capabilities for providing value to customers in the form including Release Management, Program Management
of services. The main goal of ITSM is to ensure delivery Risk Management. In addition, the publication describes
of quality IT Services that support the business objectives the processes associated with change management.
of the organization by using the cost-effective resources. Equally important part in this phase of the introduction is
ITSM evolved during the time into the highly the concept of Configuration Management Database
standardized frameworks based on best practices. Best (CMDB), which is a database that documents the
practices evolved into the industry standards for attributes of each component of IT infrastructure (known
management of ICT (ISO/IEC20000) [3] and also into the as Configuration Item, CI) and provides a model of the
public domain frameworks such as ITIL, or CoBiT [4]. components and their inter-relationships and
ITIL (IT Infrastructure Library) is nowadays a de-facto dependencies. Service Operation [8] provides procedures
standard when implementing ITSM into businesses. It for managing live and operating services in a production
provides a comprehensive set of best practices for ITSM. environment, achieving efficiency and effectiveness in
It is based on the experiences and mistakes that were service delivery and support them so that the produced
made in the UK and Europe during the implementation of value will benefit the customer as well as the service
the IT projects and provided a collection of the best provider. Processes, which are described in the
practices observed in the IT service industry. Thanks to publication, serve for monitoring, maintenance, and
the ITIL including practices that really worked it started to service improvement. This includes managing incidents
be adopted outside of the British government sector for and service requests, problem management, and
which it was originally intended and around the turn of the operations management. Continuous Service Improvement
century ITIL was considered as the internationally [9] contains the means for creating and maintaining value-
ISSN 1335-8243 (print) © 2018 FEI TUKE ISSN 1338-3957 (online), www.aei.tuke.sk
58 Predictive Models for Support of Incident Management Process in IT Service Management
added customer service by increasing service quality and methods of data transformation, attribute selection, and
efficiency of their operations. It combines principles, cleaning. Modeling phase involves the application of the
practices, and methods of quality management, change modeling techniques and calibrates of their parameters to
management and capacity improvement, while working to optimal values. The evaluation examines the constructed
improve each stage of the lifecycle, as well as the current models and matches the results to the objectives set during
services, processes-includes means for creating and the initial phases. Deployment phase represents the
maintaining value services by increasing service quality implementation of the models into the production.
and operational efficiency. Following sub-sections represent the particular phases of
The work presented in this paper mostly deals with the the methodology applied to our problem.
Service operation phase and handling of the incidents.
Incident in this context can be described as an event that 2.1. Problem understanding
leads to service interruption or is causing the service
quality level decrease. Incident management is a process, Incident management is a process which the main
that specifies how to handle the incidents in a unified way. objective is to restore the operation of the IT service
affected by corrupted CI as fast as possible. That process
The main objective of the process is to restore the service
as soon as possible. It specifies the steps needed to is often implemented in non-ideal fashion, several
perform within the process, such as prioritization and activities performed by human operators can cause delays.
Therefore, there is a need for tools assisting in the
categorization, and specifies the recommendations how it
is done. The process describes which information have to particular process activities in order to establish more
be recorded to provide its accurate representation and also fluent and effective process execution, in certain situations
also to enable the automation of the particular process
the necessary steps needed to be performed before actual
solution. Also, two different types of escalations are segments. The main idea is to leverage the existing data
introduced here. Functional and hierarchical escalations about the incidents, their processing, and related changes
data and to use the knowledge extracted from these
are different in a way, how the escalation itself is
performed. Functional one escalates the incident to a records to build the predictive models designed to assist
specialized group (designed to solve an incident of this the operators during the incident management process. As
we mentioned above, we decided to focus on two selected
type) directly, while hierarchical escalation designates the
incident to a higher level in hierarchical structure of the IT tasks. From data analysis perspective, we will build
department or organization. The process then specifies the predictive models, which could be used during the process
of Incident Management to assist the operators and people
steps needed to close and review the incident.
involved in the process with certain activities. The first
model will be used in prediction if the CI associated with
2. INCIDENT MANAGEMENT DATA ANALYSIS the incident is actually the one really responsible for the
incident occurrence. We will use a proper classification
Our main objective in this work was to perform the model, trained on the database of historical incidents, to
data analysis on top of the ITSM incident management predict if the reported CI triggered the incident. The
data. We were exploring two different tasks. The first one second model will investigate the dependency between the
was to explore the dependency between the CIs which incidents and changes in the infrastructure triggered by
were primarily assigned to the incident by a Service Desk those incidents. Also, in this case, we will use
and the CIs, which were actually responsible for the classification methods, trained on the historical data. In
service breakdown (and therefore were the primary source this case, target attribute will describe, if the incident will
of the incident). Often, CIs reported with the incidents are result in change or not. Both models will be tested and
CIs, where the incident is observed, but are not directly evaluated using pre-defined criteria – we will focus on a
responsible for service breakdown, as the incident could selected set of metrics used to evaluate the models. At
be triggered elsewhere (on another CI). The second task first, we will measure the classifier precision and error
was focused on the exploration of dependency between rate. More detailed investigation of model results will be
the incident and change management. Often, incidents described using confidence matrix and ROC (Receiver
(after their investigation), can lead to changes (changes in Operator Characteristic) and AOC (Area Under the Curve)
infrastructure; e.g. replacement of the CI for a newer one, [11] metrics. From other combined metrics, we also used
etc.). For incident managers, the information if an incident F1 metric, which combines both precision and recall.
can lead to the change could be interesting. Our goal in
this task is to build the model, which will be able to 2.2. Data understanding and data preparation
predict the need for change for a particular incident. We
used the CRISP-DM (Cross-industry Standard for Data We used the data provided by the ICT division of the
Mining) [10] methodology, which is nowadays a standard Rabobank Group (Dutch bank) [12]. The dataset consisted
in solving of data analytical tasks. CRISP-DM Consists of of several files containing specific records. Change
six major phases. Business/problem understanding focuses records contained information extracted from the Service
on the understanding of the project objectives and Management tool from the process of Change
requirements and converting of the problem into the data Management and implementation of the changes. Incident
mining problem definition. Data understanding covers the records described the processing of the incidents.
data collection and getting familiar with the data, identify Interaction records contain also related records as well as
the data problems and gaining first insights. Data resolution description with knowledge management
preparation phase covers the activities to obtain the final related fields. The last one was Incident activity records
dataset from the raw data. It usually includes multiple dataset which tracked specific activities related to the
ISSN 1335-8243 (print) © 2018 FEI TUKE ISSN 1338-3957 (online), www.aei.tuke.sk
Acta Electrotechnica et Informatica, Vol. 18, No. 1, 2018 59
solution of the particular incident. For our purposes, we Alert status – Alert status based on SLA (if it was or was
worked mostly with the Incident and Change records not breached)
dataset. Both contained the detailed descriptions of #Ressignments – number of reassignments of the incident
occurred incidents and changes, associated Configuration during the resolving
Items, times of opening, closure, related incidents, related #Related Interactions – number of related interactions
changes, etc. Dataset was used in several studies, mostly Related Interactions – list of related interactions
related to process mining [13][14] and prediction of the # Related Incidents - number of related incidents
impact of the changes [15]. In [16], authors used #Related Changes – number of related changes
predictive models (based on trees, SVM and ensemble Related Change – if a Change is related, it is recorded
models) to predict the duration of the change and its here (multivalue field if more Changes are related)
overall impact. Overall goal was to predict the Service CI Name (CBy) – CI which caused the disruption of the
Desk workload based on interactions with affected CI. service
Statistical methods were used in [17] to analyse the CI Type (CBy) – CI type
incident ticket attributes to identify trends and unusual CI Subtype (CBy) – CI Sub-type
patterns in operation. In general, research in this area often
aims towards automation of certain activities within the Attributes Description in Detail Change
Service Operation processes to make Service Desk more CI name (Aff) – CIs affected by the Change
effective [18]. In [19], an decision-making model is CI Type (Aff) – CI type
introduced, which is able (using knowledge base) achieve CI Subtype (Aff) – CI sub-type
the overall process automation and improve the efficiency Change ID – Change identifier
of provided incident responses. On the other hand, also Change Type – Change category
incident relations can be investigated in order to find re- Risk Assessment – specifies impact on business
occurring or co-occurring incidents [20]. In some cases, Emergency Change – indicates if a Change is an
certain predictive tools are integrated into the frequently Emergency one
used ITSM tools, e.g. SAP HANA supports real-time CAB-approval needed – indicated if Change Advisory
predictions using SAP Predictive Analytics1 or Board approval is needed
ServiceNow2 can be extended with Predict Incidents Planned Start – date and time of Change implementation
module with such capabilities. Our task was similar to start
research performed in area of investigation of incidents Planned End – date and time of Change implementation
relations. We focused on investigation, if the reported CI finish
was actually the CI that generated the incident and the Scheduled Downtime Start – date and time of scheduled
relation between the incident and resulting changes. downtime during the Change implementation
Following paragraph will introduce main attributes of the Scheduled Downtime End – date and time of scheduled
raw incident and change records data as present in the restore after the Change implementation
dataset. Actual Start – actual date and time of Change
implementation
Attributes Description – Incident records Actual End – actual date and time of service restore
CI name (Aff) – CI where a disruption of the service was Requested End Date – date and time of requested service
noticed. restore after Change implementation
CI Type (Aff) – type of the CI Change record Open Time – date and time of Change
CI Subtype (Aff) – sub-type of the CI record initiation
Service comp WBS (Aff) – every CI in CMDB are Change record Close Time – date and time of Change
connected to 1 Service Component to identify who is record closure
responsible for the CI Originated from – specifies the origin of Change request.
Incident ID – unique ID of the incident #Related Interactions – number of interactions during the
Status – status of the incident Change implementation
Impact – impact of the service downtime to the customer. #Related Incidents – number of incidents related to the
Urgency – how urgently the incident has to be solved Change
Priority – combines Impact and Urgency
Category – used to categorize the incidents into the groups The very first step of the data pre-processing was the
according to their similarity identification and removal of the missing values. Nine of
KM number – Knowledge Document number – refers to the Incident dataset attributes contained missing values.
Knowledge Base After the data inspection, we removed several records
Open time – the time of the record opening in the Service with missing values and selected the missing values
management tool placeholder, which specified the missing value
Reopen time – if the incident was closed and re-opened occurrence. When applicable, we replaced the missing
Resolved time – date and time when incident was resolved value with 0 (in case, that the missing value represented,
Closed time – date and time when the record was closed that the event did not occur, for example in Reassignments
Handle time – time needed to resolve the incident case), in several numeric attributes (when it made a sense,
Closure code – code that describes the type of service e.g. number related interactions) we used the replacement
disruption using mean value. Next step was to filter out the records
in both datasets, as the incident dataset contained also
1
https://www.sap.com/products/analytics/predictive-analytics.html records representing the service requests, informative
2
https://www.servicenow.com/
ISSN 1335-8243 (print) © 2018 FEI TUKE ISSN 1338-3957 (online), www.aei.tuke.sk
60 Predictive Models for Support of Incident Management Process in IT Service Management
records etc. The main idea was to keep only the incident we used several approaches to balance the distribution of
records in incident dataset and changes in change dataset. the target attribute. We built predictive models based on
Open.time attribute was transformed, and new attributes Random Forests and GBM algorithms in both tasks.
were created. Those attributes specified the month, day in Those models were selected after the preliminary
a week and hour of the incident opening. After the data experiments. Those proved, that the models were
cleaning and pre-processing, we integrated the data into (precision and recall-wise) more suitable to handle the
common consistent table. data. Therefore we continued with training and
Then we had to define and create the target attributes optimization of these models using the validation set. In
for both models. Target variables, in this case, were not the first task, we experimented with different parameters
specified in the dataset in explicit fashion but could be of the Random Forest and GBM models. The best results
transformed from certain attributes in the tables. For the on validation set were achieved when using those settings
first predictive task, we created an attribute with Random Forest model:
CI.Name.equality, which specifies, if observed and ntrees = 200,
noticed CI was really responsible for the incident stopping_rounds = 3,
occurrence or not. We compared the CI.Name.aff and score_each_iteration = TRUE
CI.Name.CBy attribute values, on case those values were
equal, CI.Name.equality value was set to 1 and in they where ntrees parameter specifies the number of the trees
were different, we set the newly created attribute to 0. We built within the forest, stopping_round parameter, which
used a similar approach to create the target attribute for is not enabled by default, is used for early stopping to
the second predictive task. In this case, we created an prevent the overfitting. The stopping metric was set to
attribute Change.ID.equality. Its value was derived from AUC and stopping tolerance parameter to 0.0005. The
the attribute values of Change.ID and Related.Change. stopping parameters specify, that the model learning will
We also explored the distribution of the target attribute stop after there have been three scoring intervals, where
values in the dataset and decided to use the use one of the the AUC has not increased more than 0.0005. We used the
techniques for the imbalanced class problem. Those will validation set, the stopping tolerance was computed on
be described in the modeling and evaluation sections. validation AUC, not on the training set itself. When using
Then we could perform the descriptive characteristics of GBM model, we used those parameter values:
the dataset attributes, respective correlations and applied learn_rate = 0.3
feature extraction methods. We decided to remove several stopping_tolerance = 0.01
attributes that did not have a significant impact on
classification and obtained a final set of predictors (e.g. In this case, we used those extra parameters learning
we used only Priority attribute and left the Impact and rate parameter was used to control the learning rate of the
Urgency attributes, as the Priority value is directly model. Smaller values of the parameter causing the model
computed from both of them). Among the most significant to learn more slowly, with more trees to reach the same
attributes in both tasks were CI_Type, CI_Subtype, Service overall error rate, but typically result in a better model,
comp as well as the attributes derived from Open.time. more general one, especially on the testing data.
Therefore, we experimented on the validation set with
2.3. Modeling multiple learn rate values and obtained best results when
lowering the value of the learning rate to 0.001. Stopping
During this phase, we focused on predictive models tolerance in this model was set to 0.001.
training. We used the R environment and as the machine For the second task, we used the same approach and
learning tool, we selected the H2o framework. H2o3 is an selected the same parameter values for both models.
open source software for data analysis and machine
learning. It provides an API for Java, Python and R 2.4. Evaluation of the models
language [21]. It also enables the developers to create H2o
cluster on top of the big data analysis platforms and This section is dedicated to the model’s evaluation of
infrastructures and to access the implemented distributed both tasks. We used several approaches to measure the
machine learning models from R environment. H2o model accuracy of the testing set. As the main metric, we
package contains implementations of currently most Receiver Operator Characteristic Area Under the Curve
popular machine learning algorithms, such as Generalized (ROC AOC) which is commonly used to present results
Linear Models (GLM), RandomForests, Gradient for binary decision problems in machine learning. Table 1
Boosting (GBM), K-Means, Deep Learning and many summarizes the results of the models with different
others including utilities and tools for data access, sampling methods used and train/test split sizes for the
preprocessing etc. first task. The best model (Random Forest trained on split
For models training, we used the dataset split into the 70/10/20) achieved best results. The average error rate
training, validation and testing sets in different sizes. The was 13,1%, split between both values of the predicted
training set was used to build the predictive models, class.
validation set was used to optimize the model parameters The confusion matrix showing the classification into
and the completely independent testing set was used for the particular classes and classification errors is shown in
evaluation purposes. We also did several experiments Table 2. F1 metric (which combines the precision and
using the Cross-validation technique, to check if it brings recall) of the model was 0.9247. The class 0, representing
any benefit when used instead of dataset splitting. Then, that the incident was not caused by reported CI, was the
class with relatively high error rate. On the other side,
3
http://www.h2o.ai/
ISSN 1335-8243 (print) © 2018 FEI TUKE ISSN 1338-3957 (online), www.aei.tuke.sk
Acta Electrotechnica et Informatica, Vol. 18, No. 1, 2018 61
more important in this task is to confirm the fact, if the 2.5. Deployment
incident was caused by reported CI. Classification of this
class was more precise and from the task perspective, the Deployment of the models into the production
error rate on this class can be more significant than to the environment represent the final stage of the CRISP-DM
other one. We focused mostly on prediction of class 1 so methodology. In this case, we demonstrated the possibility
the best models could have class 0 trained with relatively of the model deployment and integration by the
higher error-rate. implementation of the web-user interface, which simulates
the user interface of the service management tools, that
Table 1 Results of the models in the first task are usually used in businesses for ITSM purposes. The
application serves as a web-based interface to the data and
Model Sampling Train/Valid/Test AUC models. It enables the model scoring functionality –
Random recording of the incident data (data reported to the service
Both 60/20/20 83.76 desk, data recorded when an incident occurs) and
Forest
Random performing prediction (both models) on that data. The
Both 70/10/20 85.52 output of the models may serve as a kind of
Forest
Random recommendation for an operator working within the
Both 80/10/10 85.06 Incident Management process with such kind of
Forest
Random 5x Cross- application. Other implemented functionalities include
Both 85.10 several visualizations of the incident data. Such
Forest validation
Random 10x Cross- visualizations can provide the operator better insight into
Both 85.36 the incident data and enable them to build a better
Forest validation
Class complete picture of the incidents and related changes. Fig.
GBM 60/20/20 83.40 1 depicts the user interface of the implemented
weights
Class application. The application was implemented using
GBM 70/10/20 84.81 Rshiny.
weights
Class
GBM 80/10/10 85.27
weights
Class:Predicted
0 1 Error
Actual
0 500 781 0.6098
1 212 6098 0.0335
Totals 712 6879 0.1308
ISSN 1335-8243 (print) © 2018 FEI TUKE ISSN 1338-3957 (online), www.aei.tuke.sk
62 Predictive Models for Support of Incident Management Process in IT Service Management
This work was supported by Slovak Research and [15] BUFFETT, S. – EMOND, B. – GOUTTE, C.: Using
Development Agency under the contract No. APVV-16- Sequence Classification to Label Behavior from
0213 and by the VEGA project under grant No. Sequential Event Logs, In: 2014 Business Process
Intelligence (BPI) Challenge, p. 27, 2014.
1/0493/16.
[16] DEES, M. – Van Den END, F.: A Predictive Model
REFERENCES for the Impact of Changes on the Workload of
Rabobank Group ICT’s Service Desk and IT
[1] YOUNG, C. M.: ITSM Fundamentals: How to Operations BPI Challenge 2014.
Create an IT Service Portfolio, Gartner research note [17] LI, T. H. – LIU, R. – SUKAVIRIYA, N. – LI, Y. –
1–6, 2011. YANG, J. – SANDIN, M. – LEE, J.: Incident Ticket
[2] SARNOVSKY, M. – FURDIK, K.: IT service Analytics for IT Application Management Services,
management supported by semantic technologies, In: In: 2014 IEEE International Conference on Services
SACI 2011 - 6th IEEE International Symposium on Computing, pp. 568–574, IEEE, 2014.
Applied Computational Intelligence and Informatics, [18] ANDREWS, A. A. – BEAVER, P. – LUCENTE, J.:
Proceedings, 2011. Towards better help desk planning: Predicting
[3] DISTERER, G.: ISO 20000 for IT, Business & incidents and required effort, Journal of Systems and
Information Systems Engineering, 1, pp. 463–467, Software, Vol. 117, pp. 426–449, 2016.
2009. [19] YUN, M. – LAN, Y. – HAN, T.: Automate incident
[4] ISACA: COBIT 5 Framework, 2012. management by decision-making model, In: 2017
IEEE 2nd International Conference on Big Data
[5] CANNON, D.: ITIL Service Strategy 2011 edition, Analysis (ICBDA), IEEE, pp. 217–222, 2017.
(2011).
[20] LIU, R. – LEE, J.: IT Incident Management by
[6] HUNNEBECK, L.: ITIL Service Design, 2011. Analyzing Incident Relations, Presented at the
[7] CANNON, D.: ITIL Service Transition, 2011. November 12, 2012.
[8] CANNON, D. – WHEELDON, D.: ITIL Service [21] AIELLO, S. – KRALJEVIC, T. – MAJ, P.: Package
Operation, 2007. “h2o”, Cran., 2016.
[9] Great Britain Cabinet Office: ITIL Continual Service
Improvement, The Stationery Office, 5, 2011. Received October 4, 2017, accepted February 19, 2018
[10] SHEARER, C.: The CRISP-DM model: The New
Blueprint for Data Mining. Journal of Data BIOGRAPHIES
Warehousing, Vol. 5, pp. 13–22, 2000.
Martin Sarnovsky works as an assistant professor at
[11] HERNANDEZ-ORALLO, J.: ROC curves for
regression. Pattern Recognition, Vol. 46, pp. 3395– Department of Cybernetics and Artificial Intelligence at
3411, 2013. Faculty of Electrotechnics and Informatics at Technical
University of Košice since 2010. He graduated (MSc
[12] Van DONGEN, B. F.: BPI Challenge 2014. degree) with distinction at the Department of Cybernetics
http://data.4tu.nl/repository/uuid:c3e5d162-0cfd- and Artificial Intelligence in study programme Artificial
4bb0-bd82-af5268819c35 Intelligence. He defended his PhD. degree in the area
distributed classification of textual documents entitled
[13] THALER, T. – KNOCH, S. – KRIVOGRAD, N. –
“Knowledge discovery in text documents using the grid
FETTKE, P. – LOOS, P.: ITIL Process and Impact
computing” in 2009. His scientific research is mostly
Analysis at Rabobank ICT, Proceedings of the 4th
focused on data and text analysis, mostly focusing on big
Business Process Intelligence Challenge, Business
data aspects, such as streams processing. Among his other
Process Intelligence Challenge (BPIC-14), located at
fields of professional interests are semantic modelling,
Business Process Management (BPM2014),
ontologies, and it service management.
September 7-11, Eindhoven, Netherlands, 2014.
[14] ARIAS, M. – ARRIAGADA, M. – ROJAS, E. – Juraj Surma currently a student at the Department of
SAINT-PIERRE, C. – SEPÚLVEDA, M.: Artificial Intelligence at Faculty of Electrotechnics and
Rabobank: Incident and change process analysis, Informatics at Technical University of Košice in study
Proceedings of the 4th Business Process Intelligence programme Business Informatics. In 2017, he received his
Challenge, Business Process Intelligence Challenge bachelor degree in Business Informatics. He is mostly
(BPIC-14), located at Business Process Management focusing on area of big data analysis and related
(BPM2014), September 7-11, Eindhoven, technologies.
ISSN 1335-8243 (print) © 2018 FEI TUKE ISSN 1338-3957 (online), www.aei.tuke.sk