Report Diabetics

Computer Methods and Programs in Biomedicine Update 1 (2021) 100032
Contents lists available at ScienceDirect
Computer Methods and Programs in Biomedicine Update

journal homepage: www.sciencedirect.com/journal/computer-methods-
and-programs-in-biomedicine-update
Prediction of diabetes using logistic regression and ensemble techniques

Priyanka Rajendra, Shahram Latifi *
Department of Electrical and Computer Engineering, UNLV, Las Vegas, Nevada, United States
A R T I C L E I N F O A B S T R A C T
Keywords: Background: Logistic regression is a classification model in machine learning, extensively used in clinical analysis.
Diabetes It uses probabilistic estimations which helps in understanding the relationship between the dependent variable
Ensemble methods and one or more independent variables. Diabetes, being one of the most common diseases around the world,
Feature selection
when detected early, may prevent the progression of the disease and avoid other complications. In this work, we
Logistic regression
Prediction model
design a prediction model, that predicts whether a patient has diabetes, based on certain diagnostic measure
ments included in the dataset, and explore various techniques to boost the performance and accuracy.
Methods: Logistic Regression is the main algorithm used in this paper and the analysis is carried out using Python
IDE. The experiment mainly uses two datasets – one is the PIMA Indians Diabetes dataset, which is originally
from the National Institute of Diabetes and Digestive and Kidney Diseases, and the other dataset is from Van
derbilt, which is based on a study of rural African Americans in Virginia. Feature selection is carried out using
two different methods. Ensemble methods are further used, that improve performance by producing better
predictions compared to a single model.
Results: The accuracy and runtimes are captured for the original datasets and also for the ones obtained after
using feature selection and ensemble techniques. A comparison is also shown in each case. The highest accuracy
obtained was around 78% for Dataset 1, after employing the ensemble technique- Max Voting; and it was around
93% for Dataset 2, after using the ensemble techniques- Max Voting, and Stacking.
Conclusion: Logistic Regression has shown to be one of the efficient algorithms in building prediction models.
This study also shows that apart from the choice of algorithms, there are other factors that could improve the
accuracy and runtimes of the model, such as: data-preprocessing, removal of redundant and null values,
normalization, cross-validation, feature selection, and usage of ensemble techniques.
1. Introduction Diabetes is seen when body cells are not able to use insulin properly.
Type-3 Gestational Diabetes increases the blood sugar level in pregnant
Diabetes, also known as Diabetes Mellitus, targets many people woman [3]. This happens when diabetes is not detected in the early
around the world. According to the International Diabetes Federation, stages. Even though Diabetes is incurable, it can be managed by treat
approximately 463 million adults (20–79 years) were living with dia ment and medication.
betes in 2019. They predicted that by 2045 this will rise to 700 million. Many healthcare organizations are now using Machine Learning
Diabetes prevalence has been rising more rapidly in low- and middle- Techniques, such as Predictive Modeling in healthcare. Additionally,
income countries than in high-income countries. Diabetes is a major there are complex algorithms at play, identifying processes and patterns
cause of blindness, kidney failure, heart attacks, stroke and lower limb invisible to the human eye. This helps researchers discover new medi
amputation [1]. It is also estimated that around 84.1 million Americans cine and treatment plans. Predictive modeling uses data mining, ma
who are 18 years or older have prediabetes [2]. chine learning, and statistics to identify patterns in data and recognize
There are three types of Diabetes. Type-1 is known as Insulin- the chances of outcomes occurring.
Dependent Diabetes Mellitus (IDDM). The reason behind this type of This paper focuses on building a predictive model for diabetes to
diabetes is the inability of a human’s body to generate enough insulin. In identify if a certain patient has diabetes and then various techniques are
this case, the patient is required to inject insulin. Type-2 is also known as explored to improve accuracy. Logistic Regression will be used to
Non-Insulin-Dependent Diabetes Mellitus (NIDDM). This type of develop the main model, and the first dataset used is the PIMA Indian
* Corresponding author.
E-mail addresses: rajenp1@unlv.nevada.edu (P. Rajendra), shahram.latifi@unlv.edu (S. Latifi).
https://doi.org/10.1016/j.cmpbup.2021.100032
Received 9 February 2021; Received in revised form 11 October 2021; Accepted 12 October 2021
Available online 25 October 2021
2666-9900/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
P. Rajendra and S. Latifi Computer Methods and Programs in Biomedicine Update 1 (2021) 100032
Dataset [4]. In this dataset, all patients are females who are at least 21 dataset is from Vanderbilt, which is based on a study of rural African
years old. The paper explains the step by step process of the model, - Americans in Virginia. It consists of 16 features. There are 390 data
from its design to its implementation. The second dataset used is from samples with both male and female patients. Here Dataset 1 refers to the
Vanderbilt [5], which is based on a study of rural African Americans in PIMA Indian Dataset and Dataset 2 refers to the dataset from Vanderbilt.
Virginia. It consists of 16 features. This dataset consists of both male and The main algorithm used in the prediction model is Logistic Regression,
female patients. although few other machine learning techniques such as Decision Tree,
The rest of the paper is organized as follows: Section 2 discusses the Support Vector Machines, K- nearest Neighbors and Naïve Bayes are
previous work and how this work differs from theirs. Sections 3 and 4 used in the ensemble methods to test the improvement in the original
consist of the model design, methods used and results. Section 5 lists the performance. We have designed a flowchart of the prediction model,
challenges which are yet to be addressed and finally Section 6 concludes which is shown in Fig. 1. It shows the flow of how the implementation
the overall experiment. will be carried out.
Various methods are explored to improve the performance and
2. Literature review execution time. Firstly, it starts with two feature selection methods -
creating new features and then selecting the best; and the other method
Machine Learning Techniques are becoming more useful in the is Univariate Feature Selection [11]. The second technique is using
medical sector. Many researchers have used various Machine Learning ensemble methods. In this project two ensemble methods, namely- Max
and Deep Learning Techniques and Algorithms to predict diabetes. Voting/Majority Voting and Stacking are used. All the results are
Aishwarya and Vaidehi [3] used several machine learning algorithms analyzed using the IDE PyCharm with Python 3.6 version on the Win
such as Support Vector Machines, Random Forest Classifier, Decision dows 10 platform.
Tree Classifier, Extra Tree Classifier, Ada Boost algorithm, Perceptron,
Linear Discriminant Analysis algorithm, Logistic Regression, K-NN, 4. Methods and results
Gaussian Naïve Bayes, Bagging algorithm and Gradient Boost Classifier.
They used two different datasets- the PIMA Indian and another Diabetes 4.1. Dataset selection and feature evaluation
dataset for testing the various models. Logistic Regression gave them an
accuracy value of 96%. On the other hand, Tejas and Pramila [6] chose The first steps involve selecting the dataset for the model and eval
two algorithms- Logistic Regression and SVM to build a diabetes pre uating its features. For this paper, the first dataset chosen is the PIMA
diction model. The pre-processing of data was carried out to obtain Indian dataset. There are a total of nine features/variables, among which
better results. They found that SVM performed better with an accuracy eight are predictor variables and 1 is the target variable. The features are
of 79%. as follows:
Yuvaraj and Sripreethaa [7] designed a diabetes prediction model
using three different Machine Learning algorithms- Random Forest, ■ Pregnancies: Number of times the patient was pregnant.
Decision Tree, and the Naïve Bayes, in Hadoop based clusters. They ■ Glucose: Plasma glucose concentration over two hours in an oral
employed pre-processing techniques on the dataset. The results showed glucose tolerance test.
that the highest accuracy rate of 94% was obtained with the Random ■ BloodPressure: Diastolic blood pressure (mm Hg).
Forest algorithm. Deepti and Dilip [8] used Decision Tree, SVM, and ■ SkinThickness: Triceps skin fold thickness (mm).
Naive Bayes algorithms. Ten-fold cross validation was used to improve ■ Insulin: Two-Hour serum insulin (mu U/ml).
performance. The highest accuracy was obtained by the Naive Bayes, ■ BMI: Body mass index (weight in kg/(height in m)^2).
with an accuracy of 76.30%. Both these papers used the Pima Indian ■ DiabetesPedigreeFunction/DPF: A function that scores the likelihood
Diabetes dataset. of diabetes based on family history.
Both, Olaniyi and Adnan [9], and Swapna et al. [10] made use of ■ Age: In years.
Deep Learning techniques for diabetes prediction. The former made use ■ Outcome: Class variable (0 if non-diabetic, 1 if diabetic). This is the
of a Multilayer Feed-Forward Neural Network. The back-propagation target variable.
algorithm was used for training the model. They also used the PIMA
Indian dataset and normalized it before pre-processing, to obtain nu The second dataset used is the Vanderbilt dataset. It consists of 16
merical stability. They obtained 82% accuracy. The latter used a dataset features, out of which one is the target variable, i.e. Diabetes:
called Electrocardiograms on two models using CNN and CNN-LSTM.
The dataset consisted of 142,000 samples and eight attributes. They ■ Patient number: Identifies patients by number
obtained an accuracy of 93.6% with the CNN model and an accuracy of ■ Cholesterol: Total cholesterol
95.1% for the CNN-LSTM model with a five-fold cross validation for ■ Glucose: Fasting blood sugar
both. ■ HDL: HDL or good cholesterol
All the above studies gave a comparative performance analysis ■ Chol/HDL: Ratio of total cholesterol to good cholesterol. Desirable
among various machine learning algorithms. Some of them used data result is < 5
pre-processing and cross-validation techniques to improve accuracy, but ■ Age: Age of the patient
all of them focused more on the comparison of performance between the ■ Gender: 162 males, 228 females
various models rather than improving a single model. In this paper, I ■ Height: In inches
have concentrated on a single model and explored techniques which can ■ Weight: In pounds (lbs)
not only improve accuracy but also improve execution speed, thus ■ BMI: 703 x weight (lbs)/ [height(inches]2
increasing the performance. This paper shows that in addition to algo ■ Systolic BP: The upper number of blood pressure
rithm selection, pre- and post- processing of data play a major role in the ■ Diastolic BP: The lower number of blood pressure
overall improvement of the model. ■ Waist: Measured in inches
■ Hip: Measured in inches
3. Model design ■ Waist/hip: Ratio is possibly a stronger risk factor for heart disease
than BMI
Two main datasets are used in this paper. The first one is the PIMA ■ Diabetes: Yes (60), No (330)
Indian Dataset, which consists of 768 patient’s data who are all females
of the age 21 or older. There are nine features in total. The second
2
Fig. 1. Flowchart of the Diabetes Prediction Model.
4.2. Loading data with diabetes and the remaining 500 without diabetes, whereas the
Dataset 2 has 60 diabetic patients and 330 who are not diabetic. The
The data, which is in the CSV format, is loaded to a variable. There heatmap displayed in Fig. 2, shows the correlation between the features
are 768 data points in Dataset 1 and 390 data points in Dataset 2. of Dataset 1. The lighter colors represent more correlation and the
darker colors represent less correlation. The Fig. 3 shows a bar plot
displaying count of patients with and without Diabetes in Dataset 1.
4.3. Data exploration
Data exploration involves getting insights about the data and finding
the correlation between the features. Dataset 1 consists of 268 patients
Fig. 2. Correlation values and heat map, showing the correlation between the features for Dataset 1.
3
4.6. Improving accuracy
Various methods are employed to improve the accuracy. The

methods include: - creating new features; feature selection using Uni
variate feature selection; and ensemble methods, such as Max Voting
and Stacking.
4.6.1. Creating new features

Five new features are created from the existing features and added to
the dataset. These features are created based on a research of certain
diagnostic measurements applicable to diabetic patients. The heatmap is
plotted and based on the correlation with the output and with each
other, eight features have been selected. With these eight features, the
accuracy is again calculated. The accuracy seemed to increase. For this
method, Dataset 1 is used.
The new features are labelled as NF1, NF2, NF3, NF4 and NF5. The
first parameter NF1 is basically chosen on the basis that, usually, people
above the age of 30 are less prone to this disease. Also, the blood sugar
Fig. 3. xxx.
level below 140 mg/dL is normal [12]. NF2 is for individuals with BMI
higher that 30 kg/m2, who are at a higher risk of diabetes [13]. NF3 is
4.4. Data pre-processing chosen based on the study carried out by Chengjie Lv et al. that women
with pregnancies greater or equal to four are at a greater risk of diabetes
The missing values, null values and values equal to zero for the than the women with number of pregnancies three or less [14]. The
predictor variables need to be identified in the dataset. The predictor normal diastolic blood pressure is less than or equal to 80 and this is the
variables/features cannot have a zero value except for certain features, main factor considered for feature NF4. The last feature NF5 is a com
like for example the ‘Pregnancies’ feature in Dataset 1. These values bination of a normal glucose value and a higher BMI value. These fea
need to be replaced with the mean values of the column. This is an tures are described as follows:
important step to increase the accuracy of prediction, as faulty values
increase the chances of incorrect predictions. • NF1 - Age less than or equal to 30 and Glucose value less than or
equal to 140
4.5. Training and testing • NF2 - BMI is less than or equal to 30
• NF3 - Age less than or equal to 30 and Pregnancies is less than or
The data is split into training and testing sets. The common split equal to three
ratios are 80:20 and 70:30. In this project, the data is split in the ratio of • NF4 - Glucose value less than or equal to 140 and Blood Pressure less
70:30, i.e. 70% for training and 30% for testing. A Logistic Regression than or equal to 80
algorithm is used to make the predictions and check for the accuracy. • NF5 - Glucose value less than or equal to 140 and BMI less than or
The execution time is also calculated. equal to 45
For predictions, there are four important terms: - True Positive (TP),
True Negative (TN), False Positive (FP) and False Negative (FN). TP and The correlation between all the old and new features are calculated.
TN represent the cases when the actual outcome and the result are the Fig. 4 shows the heat map with the correlation values between the
same, whereas FP and FN are the cases when the opposite results are features.
obtained. A classification report is generated which includes Precision, Depending on the correlation values, eight features are selected in
Recall, F1 score, and Support. The Precision metric shows what percent such a way that they have less correlation among each other and more
of predictions are correct. Recall describes what percent of positives are correlation with the output. The extra features are deleted and only the
correctly identified. The F1 score is the percent of positive predictions selected eight features are retained. Fig. 5 shows the heat map with only
that are correct. Support is the count of actual occurrences of the class in the retained values.
the specified dataset. Table 1 shows the classification report for both the The training and testing are again carried out using these new fea
datasets: tures, and the accuracy and execution time are noted. Table 2 shows the
classification report with the new features. Table 3 also shows the dif
Precision = TP/(TP + FP) ference in execution times and accuracy before and after using this
Recall = TP/(TP+FN) method. We can observe an increase in the accuracy and a major change
F1 Score = 2*(Recall * Precision) / (Recall + Precision) in the runtime of the program, which decreased to a great extent. This
highlights the importance of feature selection [15] in improving model
Table 1 performance.
Classification report.
For Dataset 1 Precision Recall F1 score Support 4.6.2. Univariate feature selection
Univariate Feature selection [16] is a method that works by selecting
0 0.77 0.85 0.81 151
1 0.66 0.53 0.58 80 the best features based on univariate statistical tests. Each feature is
Accuracy 0.74 231 compared to the target variable and any statistically significant rela
Macro Avg. 0.71 0.69 0.70 231 tionship between them is determined. It is also called analysis of vari
Weighted Avg. 0.73 0.74 0.73 231 ance. The feature selection is carried out using the Univariate feature
For Dataset 2 Precision Recall F1 score Support
0 0.90 0.96 0.93 93
selection, in particular, - using the chi-square test [17]. Chi-Square
1 0.78 0.58 0.67 24 measures how the expected count and observed count deviate each
Accuracy 0.88 117 other. The formula is given below:
Macro Avg. 0.84 0.77 0.80 117 ∑ /
Weighted Avg. 0.87 0.88 0.87 117 (χ c )2 = (Oi − − Ei)2 Ei
4
Fig. 4. Heat map showing the correlation between all the old and new features.
Fig. 5. Heat map with correlation values of retained features.
where: used to achieve the univariate feature selection in Python. Eight highest
scoring features are selected, and the test is carried out. The selected
c = degrees of freedom features are: – Cholesterol, Glucose, HDL Chol, Chol/HDL ratio, Age,
O = observed value(s) Weight, Systolic BP and Waist. For this method, the second dataset from
E = expected value(s) Vanderbilt is used. Training and testing using Logistic Regression are
carried out with the selected features. The classification report after
The ‘SelectKBest’ and ‘chi2’ methods from the sklearn library are Univariate Feature selection can be observed in Table 4. In this method,
there was a slight increase in the accuracy but significant decrease in the
5
Table 2 In this experiment, we make use of five different machine learning

Classification report for dataset 1 after Feature Selection. algorithms: – Logistic Regression, Decision Tree, Support Vector Ma
Precision Recall F1 score Support chines, K- nearest Neighbors and Naïve Bayes. The accuracy of each
individual model, as well as the ensemble models for both datasets, are
0 0.81 0.82 0.81 151
1 0.65 0.62 0.64 80 shown in Table 6. The max voting ensemble technique increased the
Accuracy 0.75 231 accuracy of the models in both the cases.
Macro Avg 0.73 0.72 0.73 231
Weighted Avg 0.75 0.75 0.75 231
4.6.3.2. Stacking. Stacking is a technique that uses predictions from
multiple models to build a new model. The models used to build the new
model are known as base models. The base models are learnt parallelly
Table 3 and combined by a meta-model or a final model to give the final
Comparison before and after feature selection for Dataset 1. prediction.
Accuracy Execution Time In this experiment, Decision Tree, Naïve Bayes and k-NN models are
Before 0.7403 0.042 used as the base models and the Logistic Regression model is used as the
After 0.7532 0.008 final model for prediction. Cross- validation is essential with Stacking.
This is because there is a limitation with this method, that is, all the base
models contribute equally to the intermediate result, no matter how well
Table 4 a single model may perform. In this experiment, there is no improve
Classification report for dataset 2 after univariate feature selection. ment in accuracy without cross-validation. Once a three-fold cross-
validation is carried out for the models used in stacking, there is an
Precision Recall F1 score Support
improvement in accuracy for both the datasets. Table 7 shows the ac
0 0.89 0.98 0.93 93
curacy values of the base models and the final model after k-fold cross-
1 0.87 0.54 0.67 24
Accuracy 0.89 117
validation. Table 8, given below, shows the comparison of accuracy
Macro Avg 0.88 0.76 0.80 117 before and after Stacking.
Weighted Avg 0.89 0.89 0.88 117
4.7. Results summary

runtime. This result can be verified from Table 5. Reduction of the
runtime is essential especially when handling larger datasets or while The predictive model was first designed with only the Logistic
handling time sensitive projects. Regression algorithm after pre-processing the null values and elimi
nating the missing values. The feature selection techniques were later
4.6.3. Ensemble methods employed to help improve the accuracy and execution time. For Dataset
Ensemble methods/algorithms help to improve the machine learning 1, new features were created from existing ones and a correlation heat
results by combining several models. Hence the performance of the map was plotted to select the best eight features. For Dataset 2, Uni
predictive models can be further improved by using such methods. variate feature selection using chi-square test, was employed, choosing
Ensemble methods placed first in many prestigious machine learning eight top-scoring features. Ensemble methods were further used to try
competitions, such as the Netflix Competition, KDD 2009, and Kaggle and boost performance. Max/Majority Voting and Stacking methods
[18]. In this project, two methods- Max Voting and Stacking are used, in were tested on both the datasets. The former method proved to be a best
an attempt to improve the performance, and the differences after using method among all, by showing a significant improvement in perfor
these methods are noted. mance. The latter performed well after cross- validation was
incorporated.
4.6.3.1. Max voting/majority voting. This is one of the simplest methods Table 8, given below, shows the complete summary of the various
in ensemble techniques. It combines predictions from various machine techniques used in this experiment to build and improve the model, with
learning algorithms. The predictions by each model are known as a their accuracy values. Figs. 6 and 7 show the accuracy bar plots of these
‘vote’. The predictions which are obtained from the majority votes of the techniques. LR represents Logistic Regression in both the figures.
models are used as the final prediction [19]. In general, the formula for
Max Voting [20] for classifiers c1, c2,….,cN is given as follows: 5. Discussion
∑
P(X) = argmaxi N j=1 wj pijwhere:
Prediction of diabetics using AI techniques has been addressed in the
P(X) is the final prediction literature, but for the first time we adopt the regression method to this
wj is the weight of the jth classification problem to improve the accuracy. In contrast to prior work [21,22],
pij is the is the probability estimate from the jth classification rule for where the focus was on a single model, here several models were used
ith class using ensemble techniques to improve the performance. It was discov
ered that the performance does not just depend on the algorithm used
We have used the ‘VotingClassifier’ from the sklearn library to carry but on other factors such as data-preprocessing, feature selection and the
out the Max Voting. This method is applied after feature selection is use of different models using ensemble techniques such as max/majority
carried out in both datasets, in the hope of increasing the accuracy. voting and stacking. The improvement in both, accuracy and execution
Additionally, a ten-fold cross-validation is applied to add an additional time, was highlighted.
boost to the performance. There are several challenges that come with using AI techniques. The
technologies are still not extensively used in clinical practice. There is a
Table 5 requirement of larger datasets, so that more accurate and reliable
Comparison before and after feature selection for dataset 2. models can be obtained. This can increase the complexity of the model.
In addition, a single model may not be applicable to the overall popu
Accuracy Execution Time
lation. The major challenges in disease-risk prediction modeling with
Before 0.8803 0.036 the machine learning methods include the lack of reproducibility and
After 0.8889 0.005
external validation [21].
6
Table 6
Accuracy for individual and ensemble models.
Logistic Regression Decision Tree Support Vector Machine k-Nearest Neighbor Naïve Bayes Ensemble Model
Dataset 1 0.7532 0.7229 0.7489 0.6797 0.7403 0.7783

Dataset 2 0.8889 0.8803 0.8803 0.8803 0.8889 0.9341
Table 7
Comparison of Accuracy with and without Stacking.
Base Accuracy after Final Model Accuracy of Logistic
Models k-fold Regression after k-fold
Decision 0.724
Tree
Dataset Naïve 0. 748 0.7635
1 Bayes
k-NN 0. 756
Decision 0.923
Tree
Dataset Naïve 0.930 0.9304
2 Bayes
k-NN 0.923
Table 8
Comparison of Accuracy with and without Stacking Summary of all the
techniques.
Without Stacking With Stacking
Dataset 1 0.7532 0.7635

Fig. 7. Bar plot depicting accuracies of various techniques for Dataset 2.
Dataset 2 0.8889 0.9304
Accuracy of Accuracy of
Dataset 1 Dataset 2 There are also challenges in finding the correct balance between the
Only with Logistic Regression 0.7403 0.8803 accuracy and execution speeds. With more emphasis given on improving
With Feature Selection 0.7532 0.8889
accuracy, the complexity of the model may increase, and improvement
With Ensemble Technique – Max 0.7783 0.9341
Voting of execution speeds may take a backseat. Finding the correct balance is
With Ensemble Technique – 0.7635 0.9304 essential, as both these factors affect the performance of the machine
Stacking learning model.
6. Conclusion
Logistic Regression has been shown to be one of the efficient algo

rithms in building prediction models. The accuracy of the model does
not only depend on the algorithm chosen, but also on other factors. Data-
preprocessing is one such factor. Removing redundant and null values is
essential in improving performance. The normalization of values, when
the features differ by large scale, also plays an important role. In this
paper, we have seen that feature selection plays a significant role in
increasing the accuracy and reducing the runtime. Combining various
algorithms, as observed in the ensemble techniques, serves as a factor in
improving the performance of the model. Cross-validation also holds an
essential role to boost accuracy.
Declaration of Competing interests
The authors declare that they have no known competing financial

interests or personal relationships that could have appeared to influence
the work reported in this paper.
Fig. 6. Bar plot depicting accuracies of various techniques for Dataset 1.

Acknowledgement
Another major challenge is the risk of patients’ sensitive information

We would like to thank Meagan Madariaga-Hopkins for her proof
being leaked. Wearable technology constantly records data on the
reading and correction of this manuscript.
behavior of diabetic users, and also encourages them to input their
sensitive health data for it to provide more accurate, tailor-made sug
References
gestions. This would inherently expose users to the risk of having per
sonal data leaked or being used for other purposes without their consent [1] World Health Organization, 2021 https://www.who.int/news-room/fact-sheets/de
[22]. tail/diabetes.
7
[2] National Institute of Diabetes and Kidney Diseases, 2021 https://www.niddk.nih. [13] American Diabetes Association, Diabetes Care, https://care.diabetesjournals.
gov/health-information/diabetes. org/content/30/6/1562#:~:text=Individuals%20with%20BMI%20%E2%89%
[3] A. Mujumdar, V. Vaidehi, Diabetes prediction using machine learning algorithms, A530,life-years%20lost%20to%20diabe.
in: International Conference on Recent Trends in Advanced Computing”, 2019, [14] C. Lv, C. Chen, Qi Chen, H. Zhai, Li Zhao, Y. Guo, N. Wang, Multiple Pregnancies
ICRTAC, 2019. and the Risk of Diabetes Mellitus in Postmenopausal Women, National Library of
[4] PIMA Indian Dataset, 2021 https://www.kaggle.com/uciml/pima-indians-diabetes Medicine, 2019. September.
-database. [15] Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, H. Tang, “Predicting diabetes mellitus with
[5] Dataset from Vanderbilt, 2021 https://data.world/informatics-edu/diabetes machine learning techniques”, 2018.
-prediction. [16] “Feature selection”, 2021 Scikit learn, https://scikit-learn.org/stable/modules/feat
[6] T.N. Joshi, P.M. Chawan, Logistic regression and SVM based diabetes prediction ure_selection.html.
system, Int. J. Technol. Res. Eng. 11 (5) (2018). July-. [17] “Chi-Square Test for Feature Selection – Mathematical Explanation”,
[7] N. Yuvaraj, K.R. SriPreethaa, Diabetes prediction in healthcare systems using GeeksforGeeks 2021 https://www.geeksforgeeks.org/chi-square-test-for-feature-se
machine learning algorithms on Hadoop cluster, Cluster Comput. 22 (2017) 1–9. lection-mathematical-explanation/?ref=lbp.
[8] D. Sisodia, D.S. Sisodia, Prediction of diabetes using classification algorithms, [18] Vadim Smolyakov, 2021 “Ensemble learning to improve machine learning results”,
Procedia Comput. Sci. 132 (2018) 1578–1585. https://blog.statsbot.co/ensemble-learning-d1dcd548e936.
[9] E.O. Olaniyi, K. Adnan, Onset diabetes diagnosis using artificial neural network, [19] L.G. Kabari, U.C. Onwuka, Comparison of bagging and voting ensemble machine
Int. J. Sci. Eng. Res. 5 (2014) 754–759. learning algorithm as a classifier, Int. J. Adv. Res. Comput. Sci. Softw. Eng. (2019).
[10] G. Swapna, K.P. Soman, R. Vinayakumar, Automated detection of diabetes using March.
CNN and CNN-LSTM network and heart rate signals, Procedia Comput. Sci. 132 [20] Gareth James, Majority Vote Classifiers: Theory and Applications, PhD Thesis
(2018) 1253–1262. Dissertation, Stanford University, 1998.
[11] M.R.H. Subho, M.R. Chowdhury, D. Chaki, S. Islam, M.M. Rahman, “A univariate [21] S. Shankaracharya, Diabetes risk prediction using machine learning: prospect and
feature selection approach for finding key factors of restaurant business”, challenges, J. Bioinform., Proteom. Imaging Anal. (2017).
Conference Paper, June 2019. [22] G.T. Vu, B.X. Tran, R.S. McIntyre, H.Q. Pham, H.T. Phan, G.H. Ha, K.K. Gwee, C.
[12] Mayo Clinic, 2021 https://www.mayoclinic.org/diseases-conditions/diabetes/di A. Latkin, R.C.M. Ho, C.S.H. Ho, Modeling the research landscapes of artificial
agnosis-treatment/drc-20371451#:~:text=A%20blood%20sugar%20level% intelligence applications in diabetes, Int. J. Environ. Res. Public Health (2020).
20less,mmol%2FL)%20indicates%20prediabetes.

Report Diabetics

Uploaded by

Copyright:

Available Formats

Report Diabetics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report Diabetics

Uploaded by

Copyright:

Available Formats

Computer Methods and Programs in Biomedicine Update 1 (2021) 100032

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine Update

Prediction of diabetes using logistic regression and ensemble techniques

Fig. 1. Flowchart of the Diabetes Prediction Model.

4.6. Improving accuracy

Various methods are employed to improve the accuracy. The

4.6.1. Creating new features

Fig. 5. Heat map with correlation values of retained features.

Table 2 In this experiment, we make use of five different machine learning

4.7. Results summary

Dataset 1 0.7532 0.7229 0.7489 0.6797 0.7403 0.7783

Dataset 1 0.7532 0.7635

Logistic Regression has been shown to be one of the efficient algo

Declaration of Competing interests

The authors declare that they have no known competing financial

Fig. 6. Bar plot depicting accuracies of various techniques for Dataset 1.

Another major challenge is the risk of patients’ sensitive information

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Report Diabetics

Uploaded by

Copyright:

Available Formats

Report Diabetics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report Diabetics

Uploaded by

Copyright:

Available Formats

Computer Methods and Programs in Biomedicine Update 1 (2021) 100032

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine Update

Prediction of diabetes using logistic regression and ensemble techniques

Fig. 1. Flowchart of the Diabetes Prediction Model.

4.6. Improving accuracy

Various methods are employed to improve the accuracy. The

4.6.1. Creating new features

Fig. 5. Heat map with correlation values of retained features.

Table 2 In this experiment, we make use of five different machine learning

4.7. Results summary

Dataset 1 0.7532 0.7229 0.7489 0.6797 0.7403 0.7783

Dataset 1 0.7532 0.7635

Logistic Regression has been shown to be one of the efficient algo­

Declaration of Competing interests

The authors declare that they have no known competing financial

Fig. 6. Bar plot depicting accuracies of various techniques for Dataset 1.

Another major challenge is the risk of patients’ sensitive information

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Logistic Regression has been shown to be one of the efficient algo