0% found this document useful (0 votes)

4 views

G26_report

The project aims to predict diabetes using Logistic Regression and Support Vector Machine (SVM) models based on clinical data from a dataset of 768 samples. Both models demonstrated satisfactory performance, with SVM slightly outperforming Logistic Regression in accuracy and precision. The findings emphasize the potential of machine learning in healthcare for early diabetes detection and intervention.

Uploaded by

anujdv2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

G26_report

Uploaded by

anujdv2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Predicting Diabetes Using Logistic Regression and

Support Vector Machine

Anuj Kumar
Department of Electronics and Communication Engineering
Indian Institute of Technology Roorkee
Enrollment Number: 22116014
Team member: Gulshan (22116034)

Abstract—Diabetes is a chronic condition affecting millions c) Objectives: The primary objectives of this project are:
worldwide, with significant implications for health and quality 1) Analyze the dataset: Identify trends, patterns, and any
of life. Early detection and diagnosis are critical in preventing
severe complications. This project aims to predict the likelihood
data quality issues in the clinical dataset.
of diabetes in individuals based on clinical data, such as glucose 2) Build predictive models: Utilize Logistic Regression
levels, BMI, and age. and SVM algorithms to predict the likelihood of dia-
The analysis utilizes two machine learning models: Logistic betes.
Regression and Support Vector Machine (SVM), to build pre- 3) Compare models: Evaluate the performance of both
dictive classifiers. The dataset comprises 768 samples with eight models based on accuracy and other key metrics to
clinical features and a binary outcome indicating the presence
or absence of diabetes. identify the most effective predictive approach.
The methodology includes data preprocessing, feature scaling, II. DATA D ESCRIPTION
and splitting the data into training and testing subsets. The
performance of the models is evaluated using accuracy, precision, a) Data Source: The dataset used in this analysis is
recall, and F1-score. named diabetes.csv, which contains clinical measurements of
Key findings indicate that both models achieved satisfactory 768 individuals. This dataset was obtained for educational pur-
performance, with the SVM classifier slightly outperforming poses and is widely used in machine learning applications for
Logistic Regression in terms of accuracy and precision. These
healthcare diagnostics. If sourced from an external repository,
results demonstrate the potential of machine learning in aiding
healthcare professionals for early detection of diabetes, contribut- the details of the source should be cited appropriately.
ing to more effective prevention and treatment strategies. b) Snapshot of Data: A sample of the dataset is shown
Conclusions drawn from this study emphasize the importance below, displaying the first few rows:
of integrating machine learning tools in healthcare diagnostics to
improve patient outcomes.

I. I NTRODUCTION

a) Problem Statement: Epidemiological studies show

that early detection of diabetes can significantly reduce asso-
ciated complications, such as cardiovascular diseases, kidney
damage, and vision impairment. However, timely diagnosis
remains a challenge in resource-constrained settings. This
project aims to predict diabetes based on clinical data, enabling
early intervention and potentially improving patient outcomes.
b) Motivation: Diabetes has become a global health Fig. 1. Snapshot of Dataset
crisis, with the number of affected individuals rising rapidly.
c) Data Information: The dataset consists of:
According to recent statistics, approximately 537 million
adults were living with diabetes in 2021, a number projected to • Number of samples: 768
• Number of features: 8 predictors (e.g., Glucose, BMI,
increase in the coming decades. Early diagnosis plays a pivotal
role in reducing the burden of the disease by facilitating timely Age) and 1 outcome variable (Outcome).
• Outcome Variable:
treatment and lifestyle modifications.
Machine learning offers a promising avenue for predicting – 0: Non-diabetic
diseases like diabetes by leveraging patterns in clinical data. – 1: Diabetic
This study explores how predictive modeling can contribute to The predictors provide clinical measurements essential for
healthcare, empowering practitioners with reliable diagnostic assessing diabetes risk, such as glucose levels, blood pressure,
tools and enhancing patient care. BMI, and age.
d) Data Preprocessing: To ensure the dataset is ready for B. Support Vector Machine (SVM)
analysis, the following preprocessing steps were performed: The SVM algorithm aims to find the optimal hyperplane
1) Checking for Missing Values: A heatmap was gen- that maximizes the margin between the two classes. The linear
erated to visualize missing values in the dataset. No decision boundary in SVM can be represented as:
missing values were identified, indicating the dataset is
complete. w·X +b=0
2) Converting Outcome to a Categorical Variable: The
Where:
target variable Outcome was converted to a categorical
• w is the weight vector perpendicular to the decision
data type to facilitate classification tasks.
boundary (hyperplane).
3) Feature Scaling: All numerical features were scaled us-
• b is the bias term (offset from the origin).
ing the StandardScaler to standardize the data. This
• X is the input feature vector.
ensures that features with larger magnitudes, such as
Glucose, do not dominate the model’s learning process. The goal of SVM is to find w and b such that the margin
between the classes is maximized. The margin is defined as the
distance between the closest points of the two classes (support
III. M ETHODOLOGY
vectors) and the hyperplane. The optimization problem is
A. Logistic Regression formulated as:

Logistic Regression models the probability that an obser- 1

maximize
vation belongs to a particular class (e.g., Outcome = 1 for ||w||
diabetic). The logistic model is based on the log-odds function. Subject to:
Assumptions:
yi (w · Xi + b) ≥ 1, for all training samples i = 1, . . . , N
1) The relationship between predictors and the log-odds of
the outcome is linear. Where:
2) The dataset has no multicollinearity among predictors. • yi ∈ {−1, 1} is the class label for the i-th training sample
3) The predictors are independent of each other. (i.e., diabetic or non-diabetic).
• Xi is the feature vector for the i-th sample.
Why it’s suitable for this dataset:
• w and b are the parameters to be learned.
The dataset contains a binary outcome (Outcome), making
Logistic Regression a natural choice for initial modeling. This is a quadratic optimization problem that can be
Additionally, the clinical features are continuous or ordinal, solved using methods like quadratic programming.
aligning well with the assumptions of Logistic Regression. C. Model Building
Mathematical Model: The log-odds of the probability The following steps were taken to build and evaluate the
P (y = 1|X) is modeled as a linear function of the input models:
features X: Data Splitting:
The dataset D = {(X1 , y1 ), (X2 , y2 ), . . . , (XN , yN )} is

P (y = 1|X)
split into training and testing sets:
log = β0 + β1 X1 + β2 X2 + · · · + βn Xn
1 − P (y = 1|X)
Dtrain , Dtest = train test split(D, test size = 0.30)
Where:
Where:
• P (y = 1|X) is the probability that the outcome is 1 (i.e.,
• Dtrain is the training set (used to train both models).
the patient has diabetes).
• Dtest is the testing set (used to evaluate both models).
• X1 , X2 , . . . , Xn are the features (predictors) such as
Glucose, BMI, etc. Model Training (Logistic Regression and SVM):
• β0 is the intercept and β1 , . . . , βn are the coefficients • Logistic Regression: The model learns the coefficients
(weights) that the model learns. β0 , β1 , . . . , βn by minimizing the logistic loss function:
The sigmoid function is applied to the log-odds to obtain
N
a probability: 1 X
L=− [yi log(P (yi |Xi )) + (1 − yi ) log(1 − P (yi |Xi ))]
N i=1
1
P (y = 1|X) = • Support Vector Machine (SVM): The model learns the
1+ e−(β0 +β1 X1 +···+βn Xn )
weight vector w and bias b by solving the following
Predicted Class: optimization problem:
• If P (y = 1|X) ≥ 0.5, predict the class as 1 (diabetic). 1
• Otherwise, predict the class as 0 (non-diabetic). min ||w||2
w,b 2
Subject to the constraints for all samples: D. SVM Results
Confusion Matrix Heatmap:
yi (w · Xi + b) ≥ 1 The confusion matrix for the Support Vector Machine
(SVM) model is shown in the heatmap below:
Model Evaluation:
The performance of the trained models is evaluated using
metrics like accuracy, precision, recall, and F1-score. Let
ytrue be the true labels and ypred be the predicted labels for the
test set. The metrics are calculated as follows:
• Accuracy:

Number of correct predictions

Accuracy =
Total predictions
• Precision:
True Positives
Precision =
True Positives + False Positives
• Recall:
True Positives
Recall =
True Positives + False Negatives
• F1-Score:
Precision × Recall
F1-score = 2 ×
Precision + Recall
• Confusion Matrix:

TP FP
Confusion Matrix =
FN TN
Fig. 3. Confusion Matrix for SVM
Where:
– TP = True Positives (correctly predicted diabetics) Classification Report:
– FP = False Positives (incorrectly predicted diabetics) The classification report for the SVM model is as follows:
– FN = False Negatives (incorrectly predicted non-
diabetics)
– TN = True Negatives (correctly predicted non-
diabetics)
Classification Report:
The classification report for the Logistic Regression model
includes key metrics such as precision, recall, and F1-score
for both classes (diabetic and non-diabetic):

Fig. 4. Accuracy of SVM

Accuracy Score:
The accuracy score of the SVM model is:

Accuracy (SVM) = 0.75

Fig. 2. Accuracy of Logistic Regression E. Comparison of Models

The table below summarizes the key evaluation metrics
Accuracy Score: (accuracy, precision, recall, and F1-score) for both the Logistic
The accuracy score of the Logistic Regression model is Regression and SVM models:
calculated as: Insights from the Comparison:
- The SVM model slightly outperforms the Logistic Re-
Accuracy (Logistic Regression) = 0.74 gression model in terms of accuracy (75% vs 74%). - Logistic
Metric Logistic Regression SVM
Accuracy 0.74 0.75
diabetes risk. For instance, higher glucose levels and BMI
Precision (Class 0) 0.80 0.80 are typically associated with an increased likelihood of
Precision (Class 1) 0.62 0.64 diabetes.
Recall (Class 0) 0.79 0.81
Recall (Class 1) 0.62 0.62
• Relevance of findings in healthcare:
F1-Score (Class 0) 0.80 0.81 Early detection and prediction of diabetes are crucial in
F1-Score (Class 1) 0.62 0.63 healthcare, as timely interventions can prevent or delay
TABLE I
C OMPARISON OF M ODEL M ETRICS : L OGISTIC R EGRESSION VS SVM the onset of severe complications. The findings from this
analysis, particularly the performance of the predictive
models, highlight the importance of using machine learn-
ing models in clinical settings. These models can assist
Regression has better recall for class 0 (non-diabetic), with healthcare providers in identifying at-risk patients and tar-
a recall of 0.79 compared to 0.81 for SVM. However, SVM geting preventive measures more effectively. The ability
performs slightly better for class 1 (diabetic), with a recall of to predict diabetes based on clinical data can lead to more
0.62 compared to 0.62 for Logistic Regression. - Both models personalized and proactive healthcare management.
exhibit comparable precision for class 0 (0.80 for both), while
SVM has a slight advantage in precision for class 1 (0.64 vs V. C ONCLUSION
0.62). - The F1-scores for both models are similar, with SVM In this analysis, we compared the effectiveness of Logistic
having a slight edge for both classes (0.81 for class 0 and 0.63 Regression and Support Vector Machine (SVM) models in
for class 1, compared to Logistic Regression’s 0.80 and 0.62). predicting diabetes based on clinical data.
Based on these results, while both models perform similarly, • Both models performed similarly, with SVM showing a
SVM slightly outperforms Logistic Regression in terms of slight edge in accuracy, precision, and recall, particularly
precision and recall for class 1 (diabetic). This suggests that for classifying diabetic patients. This indicates that SVM
SVM might be marginally more suitable for distinguishing may be more suitable for distinguishing between diabetic
between diabetic and non-diabetic cases in this dataset. and non-diabetic cases, especially when dealing with
IV. D ISCUSSION clinical datasets like this one.
• Data scaling significantly improved the performance of
1) Observations:
the SVM model, highlighting the importance of proper
• Which model performed better and why?
preprocessing steps when working with machine learning
The SVM model performed slightly better than Logistic algorithms.
Regression in terms of accuracy, precision, and recall,
Recommendations for healthcare professionals:
particularly for class 1 (diabetic). This suggests that
• The findings suggest that machine learning models like
SVM might be more effective in distinguishing between
diabetic and non-diabetic cases. Its ability to handle high- SVM can be valuable tools for predicting diabetes early
dimensional data and the margin maximization principle based on clinical indicators. Healthcare professionals
likely contributed to this slight edge. could use such models to identify patients at risk and
• Impact of data scaling on SVM performance:
provide timely interventions to prevent the onset of
Data scaling played a significant role in improving the diabetes and related complications.
• It is recommended to use data preprocessing techniques
performance of the SVM model. SVM is sensitive to the
scale of the features, so scaling the data ensures that all like scaling and feature selection to enhance model ac-
features contribute equally to the model. Without scaling, curacy and effectiveness.
• In clinical practice, these models can be integrated into
the model might give more importance to features with
larger values, leading to suboptimal performance. decision support systems to assist healthcare providers
• Challenges faced during analysis:
in making informed decisions about patient care and
One of the main challenges was ensuring that the data was diabetes prevention.
appropriately preprocessed, especially handling missing VI. R EFERENCES
values and scaling the data correctly. Additionally, tuning
Dataset Source: Predict Diabetes Database.
the hyperparameters of the models to achieve optimal
Available at: https://www.kaggle.com/datasets/whenamancodes/predict-
performance required careful experimentation and testing.
diabities.
2) Insights:
• Key factors contributing to diabetes:
The dataset includes various clinical features, such as
glucose levels, blood pressure, and BMI, which are
strongly correlated with the likelihood of developing
diabetes. Feature importance analysis (e.g., using tree-
based models or logistic regression coefficients) could
help identify which features have the most influence on

Epidemiology CHEAT SHEET
No ratings yet
Epidemiology CHEAT SHEET
4 pages
Weekly Quiz 1 - PGPBABI.O.OCT19 Statistical Methods For Decision Making - Great Learning PDF
100% (1)
Weekly Quiz 1 - PGPBABI.O.OCT19 Statistical Methods For Decision Making - Great Learning PDF
7 pages
Hw6 Solution
100% (1)
Hw6 Solution
10 pages
Assignment
No ratings yet
Assignment
2 pages
Estimating diabetic risk accurately(ppt)
No ratings yet
Estimating diabetic risk accurately(ppt)
26 pages
Classifier Model For Diabetes Prediction
No ratings yet
Classifier Model For Diabetes Prediction
30 pages
c20 Final Final Ppt
No ratings yet
c20 Final Final Ppt
21 pages
Diabetes Prediction - ML
No ratings yet
Diabetes Prediction - ML
29 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
mlPPT_11_45
No ratings yet
mlPPT_11_45
31 pages
ML
No ratings yet
ML
1 page
Report Diabetics
No ratings yet
Report Diabetics
8 pages
diabetes_test report
No ratings yet
diabetes_test report
62 pages
A Comparative Analysis Using Machine Learning Algorithm on - Copy
No ratings yet
A Comparative Analysis Using Machine Learning Algorithm on - Copy
19 pages
CIEA_Term_Project
No ratings yet
CIEA_Term_Project
19 pages
Predicting Diabetes Onset Using Machine Learning
No ratings yet
Predicting Diabetes Onset Using Machine Learning
4 pages
DSPYProjectReport(1) (1)
No ratings yet
DSPYProjectReport(1) (1)
14 pages
Ek125 Final Project
No ratings yet
Ek125 Final Project
13 pages
final PPT
No ratings yet
final PPT
44 pages
Capstone Presentation Version 1.0
No ratings yet
Capstone Presentation Version 1.0
21 pages
Ai Datascience Project Grade 10
No ratings yet
Ai Datascience Project Grade 10
14 pages
Project Report
No ratings yet
Project Report
10 pages
Independent Project
No ratings yet
Independent Project
10 pages
21BCE9757 ITT Summer Internship AI ML Report
No ratings yet
21BCE9757 ITT Summer Internship AI ML Report
18 pages
final seminar report soumya
No ratings yet
final seminar report soumya
20 pages
Aiml Project Report
No ratings yet
Aiml Project Report
10 pages
BI Miniproject Report (Diabetes)
No ratings yet
BI Miniproject Report (Diabetes)
18 pages
Binod ML Project-052
No ratings yet
Binod ML Project-052
14 pages
Sankalp Report 1
No ratings yet
Sankalp Report 1
43 pages
PM For Diabetes
No ratings yet
PM For Diabetes
11 pages
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
No ratings yet
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
6 pages
Efficient Binary Classifier For Prediction of Diabetes Using Data Preprocessing and Support Vector Machine
No ratings yet
Efficient Binary Classifier For Prediction of Diabetes Using Data Preprocessing and Support Vector Machine
2 pages
Diabetes Prediction Model
No ratings yet
Diabetes Prediction Model
7 pages
Projectreport Diabetes Prediction
No ratings yet
Projectreport Diabetes Prediction
22 pages
Predicting Diabetes Mellitus in Healthcare: A Comparative Analysis of Machine Learning Algorithms On Big Dataset
No ratings yet
Predicting Diabetes Mellitus in Healthcare: A Comparative Analysis of Machine Learning Algorithms On Big Dataset
12 pages
paper 1
No ratings yet
paper 1
9 pages
Predictive Modelingand Analyticsfor Diabetesusing
No ratings yet
Predictive Modelingand Analyticsfor Diabetesusing
13 pages
5_6282551093981352604
No ratings yet
5_6282551093981352604
15 pages
Diabetes Synopsis Report
No ratings yet
Diabetes Synopsis Report
10 pages
ML Minor May
No ratings yet
ML Minor May
5 pages
Article 6
No ratings yet
Article 6
11 pages
Artificial Intelligence Approaches For Predicting Diabetes in Egypt
No ratings yet
Artificial Intelligence Approaches For Predicting Diabetes in Egypt
19 pages
Diabetes Pridiction Using Machine Learning
No ratings yet
Diabetes Pridiction Using Machine Learning
31 pages
Week 04 Logistic Regression
No ratings yet
Week 04 Logistic Regression
5 pages
Mini Project
No ratings yet
Mini Project
15 pages
YLP Logistic Regression
No ratings yet
YLP Logistic Regression
61 pages
ppt715B.pptm (Autosaved)
No ratings yet
ppt715B.pptm (Autosaved)
15 pages
Report
No ratings yet
Report
2 pages
Risab
No ratings yet
Risab
13 pages
Improving support vector machine and backpropagation performance for diabetes mellitus classification
No ratings yet
Improving support vector machine and backpropagation performance for diabetes mellitus classification
10 pages
Cross Domain Sentiment Analysis
No ratings yet
Cross Domain Sentiment Analysis
17 pages
RESULT AND DISCUSSION, Conclusion
No ratings yet
RESULT AND DISCUSSION, Conclusion
3 pages
RPF
No ratings yet
RPF
8 pages
Major Project Report 2023-2024
No ratings yet
Major Project Report 2023-2024
33 pages
Ads exp 10
No ratings yet
Ads exp 10
10 pages
Classification of Diabetes Mellitus Using Machine Learning Techniques
No ratings yet
Classification of Diabetes Mellitus Using Machine Learning Techniques
4 pages
A Mini Skill Based Project Report On: Machine Learning & Optimization (270404)
No ratings yet
A Mini Skill Based Project Report On: Machine Learning & Optimization (270404)
20 pages
s12859-023-05465-z
No ratings yet
s12859-023-05465-z
24 pages
Internshippppp Fimnalllll
No ratings yet
Internshippppp Fimnalllll
16 pages
B13 Poster (Final)
No ratings yet
B13 Poster (Final)
1 page
Predictive Model For Diabetes Using Machine Learning
No ratings yet
Predictive Model For Diabetes Using Machine Learning
38 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Core Concepts in Statistical Learning
From Everand
Core Concepts in Statistical Learning
Tushar Gulati
No ratings yet
Business Analytics, 5e 5th Edition Jeffrey D. Camm all chapter instant download
100% (1)
Business Analytics, 5e 5th Edition Jeffrey D. Camm all chapter instant download
41 pages
Lucky Charms Stats II - Math Work Sample
No ratings yet
Lucky Charms Stats II - Math Work Sample
10 pages
Essentials of Business Analytics Jeffrey D. Camm - The ebook in PDF format is ready for immediate access
100% (5)
Essentials of Business Analytics Jeffrey D. Camm - The ebook in PDF format is ready for immediate access
63 pages
The Impact of Marketing Mix On Customer Loyalty Towards Plaza Indonesia Shopping Center
No ratings yet
The Impact of Marketing Mix On Customer Loyalty Towards Plaza Indonesia Shopping Center
11 pages
Mil STD 414 PDF
No ratings yet
Mil STD 414 PDF
116 pages
Econometrics Syllabus: 1 Lectures and Office Hours
No ratings yet
Econometrics Syllabus: 1 Lectures and Office Hours
4 pages
HSTS423 - Unit 5 Multicolinearity
No ratings yet
HSTS423 - Unit 5 Multicolinearity
12 pages
Ch. 9 Montgomery RGM
No ratings yet
Ch. 9 Montgomery RGM
66 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Flexible Instructional Delivery Plan (FIDP) : Why Teach?
No ratings yet
Flexible Instructional Delivery Plan (FIDP) : Why Teach?
31 pages
Research
No ratings yet
Research
11 pages
DOE Wizard - Process and Mixture Factors
No ratings yet
DOE Wizard - Process and Mixture Factors
46 pages
L2 数量课件
No ratings yet
L2 数量课件
196 pages
Exst 2201 Exam Questions and Answers With Complete Solutions Verified
No ratings yet
Exst 2201 Exam Questions and Answers With Complete Solutions Verified
8 pages
Basic Statistics PDF
100% (9)
Basic Statistics PDF
262 pages
Chow Test
0% (1)
Chow Test
23 pages
Output SPSS
No ratings yet
Output SPSS
9 pages
Amta - Final Exams: Code: # Load The Toyotacorolla - CSV
No ratings yet
Amta - Final Exams: Code: # Load The Toyotacorolla - CSV
13 pages
COMPSCI 5590-f23-DS-rr-lecture1-4
No ratings yet
COMPSCI 5590-f23-DS-rr-lecture1-4
15 pages
Sgot/Sgpt Hemoglobin Trigliserid Totalkolestro L HDL LDL N Valid Missing Mean Std. Deviation
No ratings yet
Sgot/Sgpt Hemoglobin Trigliserid Totalkolestro L HDL LDL N Valid Missing Mean Std. Deviation
24 pages
Math435 HW 8
No ratings yet
Math435 HW 8
8 pages
Ready Classroom l31
No ratings yet
Ready Classroom l31
22 pages
1.4.5 Journal - Cell Phone Battery Life (Journal)
No ratings yet
1.4.5 Journal - Cell Phone Battery Life (Journal)
4 pages
PR2 Reviewer
No ratings yet
PR2 Reviewer
8 pages
Heteroskedastic Supply and Demand Estimation
No ratings yet
Heteroskedastic Supply and Demand Estimation
23 pages
Free Access to Australasian Business Statistics 4th Edition black Test Bank Chapter Answers
100% (4)
Free Access to Australasian Business Statistics 4th Edition black Test Bank Chapter Answers
56 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

G26_report

Uploaded by

G26_report

Uploaded by

Predicting Diabetes Using Logistic Regression and

Support Vector Machine

a) Problem Statement: Epidemiological studies show

Logistic Regression models the probability that an obser- 1

Number of correct predictions

Fig. 4. Accuracy of SVM

Accuracy (SVM) = 0.75

Fig. 2. Accuracy of Logistic Regression E. Comparison of Models

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.