Report Diabetics
Report Diabetics
Report Diabetics
A R T I C L E I N F O A B S T R A C T
Keywords: Background: Logistic regression is a classification model in machine learning, extensively used in clinical analysis.
Diabetes It uses probabilistic estimations which helps in understanding the relationship between the dependent variable
Ensemble methods and one or more independent variables. Diabetes, being one of the most common diseases around the world,
Feature selection
when detected early, may prevent the progression of the disease and avoid other complications. In this work, we
Logistic regression
Prediction model
design a prediction model, that predicts whether a patient has diabetes, based on certain diagnostic measure
ments included in the dataset, and explore various techniques to boost the performance and accuracy.
Methods: Logistic Regression is the main algorithm used in this paper and the analysis is carried out using Python
IDE. The experiment mainly uses two datasets – one is the PIMA Indians Diabetes dataset, which is originally
from the National Institute of Diabetes and Digestive and Kidney Diseases, and the other dataset is from Van
derbilt, which is based on a study of rural African Americans in Virginia. Feature selection is carried out using
two different methods. Ensemble methods are further used, that improve performance by producing better
predictions compared to a single model.
Results: The accuracy and runtimes are captured for the original datasets and also for the ones obtained after
using feature selection and ensemble techniques. A comparison is also shown in each case. The highest accuracy
obtained was around 78% for Dataset 1, after employing the ensemble technique- Max Voting; and it was around
93% for Dataset 2, after using the ensemble techniques- Max Voting, and Stacking.
Conclusion: Logistic Regression has shown to be one of the efficient algorithms in building prediction models.
This study also shows that apart from the choice of algorithms, there are other factors that could improve the
accuracy and runtimes of the model, such as: data-preprocessing, removal of redundant and null values,
normalization, cross-validation, feature selection, and usage of ensemble techniques.
1. Introduction Diabetes is seen when body cells are not able to use insulin properly.
Type-3 Gestational Diabetes increases the blood sugar level in pregnant
Diabetes, also known as Diabetes Mellitus, targets many people woman [3]. This happens when diabetes is not detected in the early
around the world. According to the International Diabetes Federation, stages. Even though Diabetes is incurable, it can be managed by treat
approximately 463 million adults (20–79 years) were living with dia ment and medication.
betes in 2019. They predicted that by 2045 this will rise to 700 million. Many healthcare organizations are now using Machine Learning
Diabetes prevalence has been rising more rapidly in low- and middle- Techniques, such as Predictive Modeling in healthcare. Additionally,
income countries than in high-income countries. Diabetes is a major there are complex algorithms at play, identifying processes and patterns
cause of blindness, kidney failure, heart attacks, stroke and lower limb invisible to the human eye. This helps researchers discover new medi
amputation [1]. It is also estimated that around 84.1 million Americans cine and treatment plans. Predictive modeling uses data mining, ma
who are 18 years or older have prediabetes [2]. chine learning, and statistics to identify patterns in data and recognize
There are three types of Diabetes. Type-1 is known as Insulin- the chances of outcomes occurring.
Dependent Diabetes Mellitus (IDDM). The reason behind this type of This paper focuses on building a predictive model for diabetes to
diabetes is the inability of a human’s body to generate enough insulin. In identify if a certain patient has diabetes and then various techniques are
this case, the patient is required to inject insulin. Type-2 is also known as explored to improve accuracy. Logistic Regression will be used to
Non-Insulin-Dependent Diabetes Mellitus (NIDDM). This type of develop the main model, and the first dataset used is the PIMA Indian
* Corresponding author.
E-mail addresses: rajenp1@unlv.nevada.edu (P. Rajendra), shahram.latifi@unlv.edu (S. Latifi).
https://doi.org/10.1016/j.cmpbup.2021.100032
Received 9 February 2021; Received in revised form 11 October 2021; Accepted 12 October 2021
Available online 25 October 2021
2666-9900/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
P. Rajendra and S. Latifi Computer Methods and Programs in Biomedicine Update 1 (2021) 100032
Dataset [4]. In this dataset, all patients are females who are at least 21 dataset is from Vanderbilt, which is based on a study of rural African
years old. The paper explains the step by step process of the model, - Americans in Virginia. It consists of 16 features. There are 390 data
from its design to its implementation. The second dataset used is from samples with both male and female patients. Here Dataset 1 refers to the
Vanderbilt [5], which is based on a study of rural African Americans in PIMA Indian Dataset and Dataset 2 refers to the dataset from Vanderbilt.
Virginia. It consists of 16 features. This dataset consists of both male and The main algorithm used in the prediction model is Logistic Regression,
female patients. although few other machine learning techniques such as Decision Tree,
The rest of the paper is organized as follows: Section 2 discusses the Support Vector Machines, K- nearest Neighbors and Naïve Bayes are
previous work and how this work differs from theirs. Sections 3 and 4 used in the ensemble methods to test the improvement in the original
consist of the model design, methods used and results. Section 5 lists the performance. We have designed a flowchart of the prediction model,
challenges which are yet to be addressed and finally Section 6 concludes which is shown in Fig. 1. It shows the flow of how the implementation
the overall experiment. will be carried out.
Various methods are explored to improve the performance and
2. Literature review execution time. Firstly, it starts with two feature selection methods -
creating new features and then selecting the best; and the other method
Machine Learning Techniques are becoming more useful in the is Univariate Feature Selection [11]. The second technique is using
medical sector. Many researchers have used various Machine Learning ensemble methods. In this project two ensemble methods, namely- Max
and Deep Learning Techniques and Algorithms to predict diabetes. Voting/Majority Voting and Stacking are used. All the results are
Aishwarya and Vaidehi [3] used several machine learning algorithms analyzed using the IDE PyCharm with Python 3.6 version on the Win
such as Support Vector Machines, Random Forest Classifier, Decision dows 10 platform.
Tree Classifier, Extra Tree Classifier, Ada Boost algorithm, Perceptron,
Linear Discriminant Analysis algorithm, Logistic Regression, K-NN, 4. Methods and results
Gaussian Naïve Bayes, Bagging algorithm and Gradient Boost Classifier.
They used two different datasets- the PIMA Indian and another Diabetes 4.1. Dataset selection and feature evaluation
dataset for testing the various models. Logistic Regression gave them an
accuracy value of 96%. On the other hand, Tejas and Pramila [6] chose The first steps involve selecting the dataset for the model and eval
two algorithms- Logistic Regression and SVM to build a diabetes pre uating its features. For this paper, the first dataset chosen is the PIMA
diction model. The pre-processing of data was carried out to obtain Indian dataset. There are a total of nine features/variables, among which
better results. They found that SVM performed better with an accuracy eight are predictor variables and 1 is the target variable. The features are
of 79%. as follows:
Yuvaraj and Sripreethaa [7] designed a diabetes prediction model
using three different Machine Learning algorithms- Random Forest, ■ Pregnancies: Number of times the patient was pregnant.
Decision Tree, and the Naïve Bayes, in Hadoop based clusters. They ■ Glucose: Plasma glucose concentration over two hours in an oral
employed pre-processing techniques on the dataset. The results showed glucose tolerance test.
that the highest accuracy rate of 94% was obtained with the Random ■ BloodPressure: Diastolic blood pressure (mm Hg).
Forest algorithm. Deepti and Dilip [8] used Decision Tree, SVM, and ■ SkinThickness: Triceps skin fold thickness (mm).
Naive Bayes algorithms. Ten-fold cross validation was used to improve ■ Insulin: Two-Hour serum insulin (mu U/ml).
performance. The highest accuracy was obtained by the Naive Bayes, ■ BMI: Body mass index (weight in kg/(height in m)^2).
with an accuracy of 76.30%. Both these papers used the Pima Indian ■ DiabetesPedigreeFunction/DPF: A function that scores the likelihood
Diabetes dataset. of diabetes based on family history.
Both, Olaniyi and Adnan [9], and Swapna et al. [10] made use of ■ Age: In years.
Deep Learning techniques for diabetes prediction. The former made use ■ Outcome: Class variable (0 if non-diabetic, 1 if diabetic). This is the
of a Multilayer Feed-Forward Neural Network. The back-propagation target variable.
algorithm was used for training the model. They also used the PIMA
Indian dataset and normalized it before pre-processing, to obtain nu The second dataset used is the Vanderbilt dataset. It consists of 16
merical stability. They obtained 82% accuracy. The latter used a dataset features, out of which one is the target variable, i.e. Diabetes:
called Electrocardiograms on two models using CNN and CNN-LSTM.
The dataset consisted of 142,000 samples and eight attributes. They ■ Patient number: Identifies patients by number
obtained an accuracy of 93.6% with the CNN model and an accuracy of ■ Cholesterol: Total cholesterol
95.1% for the CNN-LSTM model with a five-fold cross validation for ■ Glucose: Fasting blood sugar
both. ■ HDL: HDL or good cholesterol
All the above studies gave a comparative performance analysis ■ Chol/HDL: Ratio of total cholesterol to good cholesterol. Desirable
among various machine learning algorithms. Some of them used data result is < 5
pre-processing and cross-validation techniques to improve accuracy, but ■ Age: Age of the patient
all of them focused more on the comparison of performance between the ■ Gender: 162 males, 228 females
various models rather than improving a single model. In this paper, I ■ Height: In inches
have concentrated on a single model and explored techniques which can ■ Weight: In pounds (lbs)
not only improve accuracy but also improve execution speed, thus ■ BMI: 703 x weight (lbs)/ [height(inches]2
increasing the performance. This paper shows that in addition to algo ■ Systolic BP: The upper number of blood pressure
rithm selection, pre- and post- processing of data play a major role in the ■ Diastolic BP: The lower number of blood pressure
overall improvement of the model. ■ Waist: Measured in inches
■ Hip: Measured in inches
3. Model design ■ Waist/hip: Ratio is possibly a stronger risk factor for heart disease
than BMI
Two main datasets are used in this paper. The first one is the PIMA ■ Diabetes: Yes (60), No (330)
Indian Dataset, which consists of 768 patient’s data who are all females
of the age 21 or older. There are nine features in total. The second
2
P. Rajendra and S. Latifi Computer Methods and Programs in Biomedicine Update 1 (2021) 100032
4.2. Loading data with diabetes and the remaining 500 without diabetes, whereas the
Dataset 2 has 60 diabetic patients and 330 who are not diabetic. The
The data, which is in the CSV format, is loaded to a variable. There heatmap displayed in Fig. 2, shows the correlation between the features
are 768 data points in Dataset 1 and 390 data points in Dataset 2. of Dataset 1. The lighter colors represent more correlation and the
darker colors represent less correlation. The Fig. 3 shows a bar plot
displaying count of patients with and without Diabetes in Dataset 1.
4.3. Data exploration
Data exploration involves getting insights about the data and finding
the correlation between the features. Dataset 1 consists of 268 patients
Fig. 2. Correlation values and heat map, showing the correlation between the features for Dataset 1.
3
P. Rajendra and S. Latifi Computer Methods and Programs in Biomedicine Update 1 (2021) 100032
4
P. Rajendra and S. Latifi Computer Methods and Programs in Biomedicine Update 1 (2021) 100032
Fig. 4. Heat map showing the correlation between all the old and new features.
where: used to achieve the univariate feature selection in Python. Eight highest
scoring features are selected, and the test is carried out. The selected
c = degrees of freedom features are: – Cholesterol, Glucose, HDL Chol, Chol/HDL ratio, Age,
O = observed value(s) Weight, Systolic BP and Waist. For this method, the second dataset from
E = expected value(s) Vanderbilt is used. Training and testing using Logistic Regression are
carried out with the selected features. The classification report after
The ‘SelectKBest’ and ‘chi2’ methods from the sklearn library are Univariate Feature selection can be observed in Table 4. In this method,
there was a slight increase in the accuracy but significant decrease in the
5
P. Rajendra and S. Latifi Computer Methods and Programs in Biomedicine Update 1 (2021) 100032
6
P. Rajendra and S. Latifi Computer Methods and Programs in Biomedicine Update 1 (2021) 100032
Table 6
Accuracy for individual and ensemble models.
Logistic Regression Decision Tree Support Vector Machine k-Nearest Neighbor Naïve Bayes Ensemble Model
Table 7
Comparison of Accuracy with and without Stacking.
Base Accuracy after Final Model Accuracy of Logistic
Models k-fold Regression after k-fold
Decision 0.724
Tree
Dataset Naïve 0. 748 0.7635
1 Bayes
k-NN 0. 756
Decision 0.923
Tree
Dataset Naïve 0.930 0.9304
2 Bayes
k-NN 0.923
Table 8
Comparison of Accuracy with and without Stacking Summary of all the
techniques.
Without Stacking With Stacking
6. Conclusion
7
P. Rajendra and S. Latifi Computer Methods and Programs in Biomedicine Update 1 (2021) 100032
[2] National Institute of Diabetes and Kidney Diseases, 2021 https://www.niddk.nih. [13] American Diabetes Association, Diabetes Care, https://care.diabetesjournals.
gov/health-information/diabetes. org/content/30/6/1562#:~:text=Individuals%20with%20BMI%20%E2%89%
[3] A. Mujumdar, V. Vaidehi, Diabetes prediction using machine learning algorithms, A530,life-years%20lost%20to%20diabe.
in: International Conference on Recent Trends in Advanced Computing”, 2019, [14] C. Lv, C. Chen, Qi Chen, H. Zhai, Li Zhao, Y. Guo, N. Wang, Multiple Pregnancies
ICRTAC, 2019. and the Risk of Diabetes Mellitus in Postmenopausal Women, National Library of
[4] PIMA Indian Dataset, 2021 https://www.kaggle.com/uciml/pima-indians-diabetes Medicine, 2019. September.
-database. [15] Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, H. Tang, “Predicting diabetes mellitus with
[5] Dataset from Vanderbilt, 2021 https://data.world/informatics-edu/diabetes machine learning techniques”, 2018.
-prediction. [16] “Feature selection”, 2021 Scikit learn, https://scikit-learn.org/stable/modules/feat
[6] T.N. Joshi, P.M. Chawan, Logistic regression and SVM based diabetes prediction ure_selection.html.
system, Int. J. Technol. Res. Eng. 11 (5) (2018). July-. [17] “Chi-Square Test for Feature Selection – Mathematical Explanation”,
[7] N. Yuvaraj, K.R. SriPreethaa, Diabetes prediction in healthcare systems using GeeksforGeeks 2021 https://www.geeksforgeeks.org/chi-square-test-for-feature-se
machine learning algorithms on Hadoop cluster, Cluster Comput. 22 (2017) 1–9. lection-mathematical-explanation/?ref=lbp.
[8] D. Sisodia, D.S. Sisodia, Prediction of diabetes using classification algorithms, [18] Vadim Smolyakov, 2021 “Ensemble learning to improve machine learning results”,
Procedia Comput. Sci. 132 (2018) 1578–1585. https://blog.statsbot.co/ensemble-learning-d1dcd548e936.
[9] E.O. Olaniyi, K. Adnan, Onset diabetes diagnosis using artificial neural network, [19] L.G. Kabari, U.C. Onwuka, Comparison of bagging and voting ensemble machine
Int. J. Sci. Eng. Res. 5 (2014) 754–759. learning algorithm as a classifier, Int. J. Adv. Res. Comput. Sci. Softw. Eng. (2019).
[10] G. Swapna, K.P. Soman, R. Vinayakumar, Automated detection of diabetes using March.
CNN and CNN-LSTM network and heart rate signals, Procedia Comput. Sci. 132 [20] Gareth James, Majority Vote Classifiers: Theory and Applications, PhD Thesis
(2018) 1253–1262. Dissertation, Stanford University, 1998.
[11] M.R.H. Subho, M.R. Chowdhury, D. Chaki, S. Islam, M.M. Rahman, “A univariate [21] S. Shankaracharya, Diabetes risk prediction using machine learning: prospect and
feature selection approach for finding key factors of restaurant business”, challenges, J. Bioinform., Proteom. Imaging Anal. (2017).
Conference Paper, June 2019. [22] G.T. Vu, B.X. Tran, R.S. McIntyre, H.Q. Pham, H.T. Phan, G.H. Ha, K.K. Gwee, C.
[12] Mayo Clinic, 2021 https://www.mayoclinic.org/diseases-conditions/diabetes/di A. Latkin, R.C.M. Ho, C.S.H. Ho, Modeling the research landscapes of artificial
agnosis-treatment/drc-20371451#:~:text=A%20blood%20sugar%20level% intelligence applications in diabetes, Int. J. Environ. Res. Public Health (2020).
20less,mmol%2FL)%20indicates%20prediabetes.