Prediction of Diabetic Patient Readmission Using Machine Learning
Prediction of Diabetic Patient Readmission Using Machine Learning
machine learning
Juan Camilo Ramı́rez David Herrera
Facultad de Ingenierı́a de Sistemas Facultad de Ingenierı́a de Sistemas
Universidad Antonio Nariño Universidad Antonio Nariño
Bogotá, Colombia Bogotá, Colombia
juan.ramirez@uan.edu.co herrera78@uan.edu.co
Abstract—Hospital readmissions pose additional costs and in the literature, including deep learning, while requiring
discomfort for the patient and their occurrences are indicative significantly less computing power.
of deficient health service quality, hence efforts are generally
made by medical professionals in order to prevent them. These II. R ELATED WORK
endeavors are especially critical in the case of chronic conditions,
such as diabetes. Recent developments in machine learning have Various investigations can be found in the literature seeking
been successful at predicting readmissions from the medical a reliable prediction of diabetic patient readmission using a
history of the diabetic patient. However, these approaches rely variety of machine learning models and patient data sources.
on a large number of clinical variables thereby requiring deep [5], for instance, use multivariable logistic regression in order
learning techniques. This article presents the application of to show that there is a decreased risk of 30-day readmis-
simpler machine learning models achieving superior prediction
performance while making computations more tractable. sion in diabetic inpatients who have their hemoglobin A1c
Index Terms—diabetes, hospital readmission, neural network, (HbA1c) measured and that this association is true only
random forest, logistic regression for patients whose only primary diagnosis is diabetes. This
model is trained over a preprocessed dataset derived from
electronic health records, which has been made public on
I. I NTRODUCTION the UCI Machine Learning Repository and has subsequently
been reused in related studies employing different prediction
Hospital readmissions within 30 days are a health care models and preprocessing methods, all evaluated with different
quality metric, given their associated costs both to the patient measures. These metrics, derived from the model’s exhibited
and the clinical institution, and thus are one indicator of number of true positives (T P ), true negatives (T N ), false
inefficiency in the healthcare system [1]–[3]. They are amply positives (F P ) and false negatives (F N ), include accuracy
studied in a variety of medical conditions. However, they only (Equation 1), precision (Equation 2), recall (Equation 3),
recently have started to attract attention of researchers in the specificity (Equation 4), F1 (Equation 5) and the area under
study of healthcare policies for diabetic patients [4]. Different the ROC curve (ROC AUC score) obtained after plotting the
machine learning approaches, including deep learning, have model’s recall (Equation 3) against the fall-out (Equation 6).
been attempted in order to predict a patient’s risk of read-
mission based on their medical history with varying results TP + TN
Accuracy = (1)
[5]–[9]. The present investigation evaluates several machine TP + TN + FP + FN
learning models aimed at predicting readmission from clinical TP
Precision = (2)
data recorded in previous visits by the diabetic patient. The TP + FP
techniques used include logistic regression, support vector TP
machines, neural networks and random forests. The models Recall = (3)
TP + FN
are trained and evaluated over a publicly available dataset
comprising patient data from a hospital network in the United TN
Specificity = (4)
States collected over the course of nearly ten years [5]. The TN + FP
performance of all the models tested is evaluated using several 2T P
F1 = (5)
metrics, including F1 and ROC AUC (Area Under the Curve 2T P + F P + F N
in Receiver Operating Characteristic analysis). Random forests FP
is shown to outperform all the models under evaluation and to Fall-out = 1 − Specificity = (6)
FP + TN
exhibit comparable or superior prediction rates than the other
models trained over the same data and previously reported Random forests models trained on the dataset compiled
by [5] have exhibited good precision-recall scores (0.65)
[10] whereas other classifiers fine-tuned through evolutionary
978-1-7281-1614-3/19/$31.00 ©2018 IEEE algorithms (EA) have yielded good performance in terms of
accuracy, recall and specificity (0.97, 1.00, 0.97, respectively) to ensure independence of the data. Eight columns
[11]. Hybrid approaches that can capture more complex pat- were removed because most of the values in them were
terns between different features have been achieved combin- unknown or missing or because they do not pertain to the
ing Evolutionary Simulated Annealing method with sparse medical state of the patient (namely encounter_id,
LOgistic Regression model of Lasso (ESALOR) improving patient_nbr, weight, admission_type_id,
accuracy, precision, recall and F1 (0.76, 0.77, 0.77, 0.86, discharge_disposition_id,
respectively) over SVN and other conventional methods [9]. admission_source_id, payer_code and
Various classifiers have shown varying performance metrics medical_specialty). After this, numerical features
when patients in this dataset are grouped by age [7]. High were kept intact whereas categorical variables were mapped
ROC AUC (0.95) and F1 (0.92) scores have been achieved to numerical representations as follows: class labels ‘NO’
with deep learning models, including convolutional networks and ‘>30’ were merged into one, representing ‘no 30-day
[8]. The same dataset has been used along with others in order readmission’ (encoded numerically as 0), while keeping
to evaluate novel data mining and deep learning methods [12]– the third class ‘<30’ intact (encoded numerically as 1),
[14]. In other, related studies, risk factors for readmission representing ‘30-day readmission.’ This way the problem is
have also been found by training other machine learning reduced to one of binary classification.
models on different clinical datasets collected in India and The remaining categorical columns were transformed as fol-
the United States [6], [15], [16]. In one of these, motivated lows: ICD9 codes representing the diagnoses of the patient in
by striking a balance between accuracy and interpretability, each visit were grouped by their ICD9 Chapter,thus reducing
[16] proposes two novel methods: K-LRT, a likelihood ratio the number of possible values in columns diag_1, diag_2,
test-based method, and a Joint Clustering and Classification and diag_3 to only 19. Thereafter, these three categorical
(JCC) method. Results are reported in AUC (0.7924) which attributes were transformed to numerical ones by replacing
fare better than conventional methods but does not surpass them with dummy variables. That is to say, diag_1 was
random forests (0.8453). However, their method allows for replaced with new, binary columns, one for each ICD9 Chapter
interpretation and identifying key features. code, in such a way that a value of 1 in one of these new
columns indicates that a diagnosis of a disease or condition
III. M ETHODS from this ICD9 category was made and recorded in the
The dataset, originally compiled by [5] and publicly avail- original diag_1 whereas a value of 0 indicates otherwise. For
able in CSV format, is composed of electronic records span- instance, a value of 1 in dummy variable diag_1_ICD9_9
ning ten years (1999 through 2008) with various demographic would indicate that the patient’s first diagnosis pertains to
and clinical variables per patient. The most salient aspects a condition in ICD9 Chapter 9, namely a disease of the
of the dataset can be summarised very briefly as follows: digestive system, and a value of 0 would indicate otherwise.
each row corresponds to a hospital visit by a patient and each Variables diag_2 and diag_3 were replaced with dummy
patient may have more than one visit, i.e., several rows may be counterparts in the same manner. An analogous procedure
associated to the same patient. Demographic information of the was followed in order to replace other non-ordinal, categorical
patient is stored as categorical variables, including gender columns, such as race and gender, with dummy variables.
and race as well as age, which appears as labels describing Ordinal, categorical values in column age were replaced
intervals measured in years (e.g., [0, 10), [10, 20), directly with numerical values, with higher values reflecting
[20, 30), etc.). Columns diag_1, diag_2, and diag_3 higher age groups (e.g., [0, 10) was encoded as 0, [10,
contain ICD9 codes indicating the diagnoses made during the 20) was encoded as 1, etc.). Finally, all 24 columns re-
visit. Each row includes also 24 features associated to different ferring to medication prescriptions were converted to binary
medications against diabetes, each one indicating if the drug, features by merging values ‘Steady,’ ‘Up,’ and ‘Down’
or a change in its dosage, was prescribed. The possible into one single value representing ‘Drug prescribed’
values for all these 24 columns are ‘NO’ (not prescribed), while value ‘NO’ was kept intact to represent ‘Drug not
‘Steady’ (no change in dosage), ‘Up’ (increased dosage) prescribed.’ Subsequently, each one of these 24 medica-
and ‘Down’ (decreased dosage). The class attribute indicates tion prescription features was replaced with a binary dummy
if the patient was readmitted after the visit and its possible val- variable, in a manner analogous to that described earlier for
ues are ‘NO’ (i.e., no readmission), ‘<30’ (i.e. readmission diagnosis attributes. After all these feature transformations the
occurred within 30 days) ‘>30’ (i.e., readmission occurred resulting dataset comprises 100 columns, including the class
after 30 days). Full details of the dataset, including detailed column.
descriptions of the features mentioned earlier and others that In order to reduce the dimensionality of the data prior to
have been omitted for brevity, can be found in the original training the prediction models, principal component analysis
study by [5]. was conducted in order to reduce the number of features from
Prior to training the models proposed in this paper, this 100 to 45 while preserving 98% of the variance. Thereafter,
dataset was preprocessed as follows: for all patients only all features were normalised to a common scale, with unit
the first visit was retained, i.e., second and subsequent variance and zero mean. After this, prediction models with
visits from the same patient were removed in order logistic regression (LR), single layer perceptron (SLP), mul-
tilayer perceptron (MLP) and random forests (RF) were indi- achieved only through the use of deep learning on the patient
vidually trained on the selected features using 10-fold cross- clinical data collected by [5]. However, the present research
validation. Before doing this, the training data were balanced article describes a novel method for efficiently managing the
through oversampling since the original dataset was found to same dataset for the training of machine learning models,
be highly unbalanced, with 63, 417 ‘no 30-day readmission’ allowing better prediction rates without requiring deep learn-
visits against only 6, 152 ‘30-day readmission’ visits. Several ing. The best-performing model trained in this study, namely
performance metrics were calculated for each trained model, random forest, exceeds the prediction metrics reported by
including ROC AUC and F1. Overfitting was prevented by the others in the recent literature using the same base dataset. This
use of cross-validation during training and evaluation as well includes the precision-recall scores (0.65) reported by [10] as
as by the application of oversampling only on the training data as well as the ROC AUC (0.95) and F1 (0.92) scores reported
and not the testing data. by [8] through the use of convolutional networks, while also
All the models were implemented in Python (2.7.15) exceeding or closely approximating the accuracy, recall and
using the library scikit-learn (0.20). The LR model specificity (0.97, 1.00, 0.97, respectively) reported by [11].
was trained using the stochastic average gradient (SAG) solver To the best knowledge of the authors of the present study, no
implementation provided by scikit-learn. The rectified prediction models reported in the literature achieve the same
linear unit (ReLU) function was used as the activation function performance metrics obtained by the RF model presented in
in the SLP and MLP models, with the latter having one 23- this paper.
neuron hidden layer, whose size was chosen as a middle point The main contributing factor to this result is the preprocess-
between that of the input (45) and the output (1) layers. The ing of the patient data, which is achieved through a method
RF model was trained with 100 trees in order to improve the not previously explored in the related literature using the
estimates from the out-of-bag predictions without increasing same dataset. This method contemplates the simplification of
the computational cost of training. During experimentation the data in various ways, one being the reduction of several
the models were observed to exhibit the best predictive and variables’ domain with the intent of allowing the prediction
computational performance with these parameters and the models to generalise more easily. This includes the grouping
results are presented in the following section of this article. of patients’ diagnoses codes by their corresponding ICD9
Nevertheless, other parameter choices were also considered for Chapters, thus reducing the domain of these variables to only
each model during experimentation, e.g., a larger hidden layer 19 possible values. This simplification was completed with
for MLP and logistic regression as an activation function for application of a similar strategy on some of the other dataset’s
SLP, among others. These were omitted from this article for features, including the class attribute whose domain was ad-
brevity. justed from three to two possible values, thus transforming the
original problem, i.e., the prediction of a future readmission,
IV. R ESULTS into one of binary classification. This simplification of the
The performance metrics obtained with each one of the problem, paired with the dimensionality reduction achieved
models trained are listed in Table I. These metrics show that through principal component analysis as well as the class
the least performing model was the single layer perceptron balancing achieved through oversampling, results in a much
whereas the best prediction scores were achieved, by far, by more compact dataset from which generalisations can be
the random forest model. These even exceed those reportedly learned by the prediction models presented. This, without
obtained through deep learning techniques, listed in Section II, requiring deep learning techniques and also achieving higher
while requiring significantly less computing power. Notably, performance metrics than those reported previously in related
Table I also shows that none of the other models evaluated literature using the same original dataset.
managed to obtain performance scores near those achieved With the main exception of the domain reduction of some
by the random forest model, instead obtaining rather modest features described earlier, the data preprocessing procedure
scores, mostly just over 0.5. used in the present study and the methodology described in
previous, related studies share some aspects, such as consid-
V. C ONCLUSIONS eration of only first patient visits, feature selection and data
This research article presents a machine learning approach balancing [8], [11]. As stated before, this simplification of
for the identification of diabetic patients at risk of requiring the data is arguably a contributing factor to the predictive
hospital readmissions. A reduction of these statistics should be power of the models proposed. Nevertheless, interpretability of
expected to contribute towards the improvement of patients’ the prediction is still limited despite this simplification of the
well-being as well as towards a reduction in financial and dataset, given the complexity of the random forest technique.
reputational costs to healthcare institutions. This is particularly Furthermore, the proposed data preprocessing clearly comes
critical for patients of chronic conditions, such as diabetes. at the cost of reduced granularity. It can be reasonably hy-
This creates the need for policies and strategies to reduce pothesised, though evidently not guaranteed, that these results
these statistics, especially methods to predict when a patient could be further improved with larger datasets, i.e., with
is at high risk of requiring readmission in the future. The higher numbers of patient visits with the same features. This
best prediction rates reported in the recent literature have been investigation is focused on diabetes, however, this type of
Model ROC AUC F1 Precision Recall Accuracy
LR 0.5783 0.5550 0.5599 0.5515 0.5566
SLP 0.5229 0.5484 0.5129 0.5929 0.5147
MLP 0.6548 0.6164 0.6100 0.6095 0.6083
RF 0.9999 0.9974 0.9950 0.9999 0.9974
TABLE I
P ERFORMANCE METRICS OF MODELS LOGISTIC REGRESSION (LR), SINGLE LAYER PERCEPTRON (SLP), MULTILAYER PERCEPTRON (MLP) AND
RANDOM FORESTS (RF).