0% found this document useful (0 votes)
4 views

G26_report

The project aims to predict diabetes using Logistic Regression and Support Vector Machine (SVM) models based on clinical data from a dataset of 768 samples. Both models demonstrated satisfactory performance, with SVM slightly outperforming Logistic Regression in accuracy and precision. The findings emphasize the potential of machine learning in healthcare for early diabetes detection and intervention.

Uploaded by

anujdv2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

G26_report

The project aims to predict diabetes using Logistic Regression and Support Vector Machine (SVM) models based on clinical data from a dataset of 768 samples. Both models demonstrated satisfactory performance, with SVM slightly outperforming Logistic Regression in accuracy and precision. The findings emphasize the potential of machine learning in healthcare for early diabetes detection and intervention.

Uploaded by

anujdv2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Predicting Diabetes Using Logistic Regression and

Support Vector Machine


Anuj Kumar
Department of Electronics and Communication Engineering
Indian Institute of Technology Roorkee
Enrollment Number: 22116014
Team member: Gulshan (22116034)

Abstract—Diabetes is a chronic condition affecting millions c) Objectives: The primary objectives of this project are:
worldwide, with significant implications for health and quality 1) Analyze the dataset: Identify trends, patterns, and any
of life. Early detection and diagnosis are critical in preventing
severe complications. This project aims to predict the likelihood
data quality issues in the clinical dataset.
of diabetes in individuals based on clinical data, such as glucose 2) Build predictive models: Utilize Logistic Regression
levels, BMI, and age. and SVM algorithms to predict the likelihood of dia-
The analysis utilizes two machine learning models: Logistic betes.
Regression and Support Vector Machine (SVM), to build pre- 3) Compare models: Evaluate the performance of both
dictive classifiers. The dataset comprises 768 samples with eight models based on accuracy and other key metrics to
clinical features and a binary outcome indicating the presence
or absence of diabetes. identify the most effective predictive approach.
The methodology includes data preprocessing, feature scaling, II. DATA D ESCRIPTION
and splitting the data into training and testing subsets. The
performance of the models is evaluated using accuracy, precision, a) Data Source: The dataset used in this analysis is
recall, and F1-score. named diabetes.csv, which contains clinical measurements of
Key findings indicate that both models achieved satisfactory 768 individuals. This dataset was obtained for educational pur-
performance, with the SVM classifier slightly outperforming poses and is widely used in machine learning applications for
Logistic Regression in terms of accuracy and precision. These
healthcare diagnostics. If sourced from an external repository,
results demonstrate the potential of machine learning in aiding
healthcare professionals for early detection of diabetes, contribut- the details of the source should be cited appropriately.
ing to more effective prevention and treatment strategies. b) Snapshot of Data: A sample of the dataset is shown
Conclusions drawn from this study emphasize the importance below, displaying the first few rows:
of integrating machine learning tools in healthcare diagnostics to
improve patient outcomes.

I. I NTRODUCTION

a) Problem Statement: Epidemiological studies show


that early detection of diabetes can significantly reduce asso-
ciated complications, such as cardiovascular diseases, kidney
damage, and vision impairment. However, timely diagnosis
remains a challenge in resource-constrained settings. This
project aims to predict diabetes based on clinical data, enabling
early intervention and potentially improving patient outcomes.
b) Motivation: Diabetes has become a global health Fig. 1. Snapshot of Dataset
crisis, with the number of affected individuals rising rapidly.
c) Data Information: The dataset consists of:
According to recent statistics, approximately 537 million
adults were living with diabetes in 2021, a number projected to • Number of samples: 768
• Number of features: 8 predictors (e.g., Glucose, BMI,
increase in the coming decades. Early diagnosis plays a pivotal
role in reducing the burden of the disease by facilitating timely Age) and 1 outcome variable (Outcome).
• Outcome Variable:
treatment and lifestyle modifications.
Machine learning offers a promising avenue for predicting – 0: Non-diabetic
diseases like diabetes by leveraging patterns in clinical data. – 1: Diabetic
This study explores how predictive modeling can contribute to The predictors provide clinical measurements essential for
healthcare, empowering practitioners with reliable diagnostic assessing diabetes risk, such as glucose levels, blood pressure,
tools and enhancing patient care. BMI, and age.
d) Data Preprocessing: To ensure the dataset is ready for B. Support Vector Machine (SVM)
analysis, the following preprocessing steps were performed: The SVM algorithm aims to find the optimal hyperplane
1) Checking for Missing Values: A heatmap was gen- that maximizes the margin between the two classes. The linear
erated to visualize missing values in the dataset. No decision boundary in SVM can be represented as:
missing values were identified, indicating the dataset is
complete. w·X +b=0
2) Converting Outcome to a Categorical Variable: The
Where:
target variable Outcome was converted to a categorical
• w is the weight vector perpendicular to the decision
data type to facilitate classification tasks.
boundary (hyperplane).
3) Feature Scaling: All numerical features were scaled us-
• b is the bias term (offset from the origin).
ing the StandardScaler to standardize the data. This
• X is the input feature vector.
ensures that features with larger magnitudes, such as
Glucose, do not dominate the model’s learning process. The goal of SVM is to find w and b such that the margin
between the classes is maximized. The margin is defined as the
distance between the closest points of the two classes (support
III. M ETHODOLOGY
vectors) and the hyperplane. The optimization problem is
A. Logistic Regression formulated as:

Logistic Regression models the probability that an obser- 1


maximize
vation belongs to a particular class (e.g., Outcome = 1 for ||w||
diabetic). The logistic model is based on the log-odds function. Subject to:
Assumptions:
yi (w · Xi + b) ≥ 1, for all training samples i = 1, . . . , N
1) The relationship between predictors and the log-odds of
the outcome is linear. Where:
2) The dataset has no multicollinearity among predictors. • yi ∈ {−1, 1} is the class label for the i-th training sample
3) The predictors are independent of each other. (i.e., diabetic or non-diabetic).
• Xi is the feature vector for the i-th sample.
Why it’s suitable for this dataset:
• w and b are the parameters to be learned.
The dataset contains a binary outcome (Outcome), making
Logistic Regression a natural choice for initial modeling. This is a quadratic optimization problem that can be
Additionally, the clinical features are continuous or ordinal, solved using methods like quadratic programming.
aligning well with the assumptions of Logistic Regression. C. Model Building
Mathematical Model: The log-odds of the probability The following steps were taken to build and evaluate the
P (y = 1|X) is modeled as a linear function of the input models:
features X: Data Splitting:
The dataset D = {(X1 , y1 ), (X2 , y2 ), . . . , (XN , yN )} is

P (y = 1|X)
 split into training and testing sets:
log = β0 + β1 X1 + β2 X2 + · · · + βn Xn
1 − P (y = 1|X)
Dtrain , Dtest = train test split(D, test size = 0.30)
Where:
Where:
• P (y = 1|X) is the probability that the outcome is 1 (i.e.,
• Dtrain is the training set (used to train both models).
the patient has diabetes).
• Dtest is the testing set (used to evaluate both models).
• X1 , X2 , . . . , Xn are the features (predictors) such as
Glucose, BMI, etc. Model Training (Logistic Regression and SVM):
• β0 is the intercept and β1 , . . . , βn are the coefficients • Logistic Regression: The model learns the coefficients
(weights) that the model learns. β0 , β1 , . . . , βn by minimizing the logistic loss function:
The sigmoid function is applied to the log-odds to obtain
N
a probability: 1 X
L=− [yi log(P (yi |Xi )) + (1 − yi ) log(1 − P (yi |Xi ))]
N i=1
1
P (y = 1|X) = • Support Vector Machine (SVM): The model learns the
1+ e−(β0 +β1 X1 +···+βn Xn )
weight vector w and bias b by solving the following
Predicted Class: optimization problem:
• If P (y = 1|X) ≥ 0.5, predict the class as 1 (diabetic). 1
• Otherwise, predict the class as 0 (non-diabetic). min ||w||2
w,b 2
Subject to the constraints for all samples: D. SVM Results
Confusion Matrix Heatmap:
yi (w · Xi + b) ≥ 1 The confusion matrix for the Support Vector Machine
(SVM) model is shown in the heatmap below:
Model Evaluation:
The performance of the trained models is evaluated using
metrics like accuracy, precision, recall, and F1-score. Let
ytrue be the true labels and ypred be the predicted labels for the
test set. The metrics are calculated as follows:
• Accuracy:

Number of correct predictions


Accuracy =
Total predictions
• Precision:
True Positives
Precision =
True Positives + False Positives
• Recall:
True Positives
Recall =
True Positives + False Negatives
• F1-Score:
Precision × Recall
F1-score = 2 ×
Precision + Recall
• Confusion Matrix:
 
TP FP
Confusion Matrix =
FN TN
Fig. 3. Confusion Matrix for SVM
Where:
– TP = True Positives (correctly predicted diabetics) Classification Report:
– FP = False Positives (incorrectly predicted diabetics) The classification report for the SVM model is as follows:
– FN = False Negatives (incorrectly predicted non-
diabetics)
– TN = True Negatives (correctly predicted non-
diabetics)
Classification Report:
The classification report for the Logistic Regression model
includes key metrics such as precision, recall, and F1-score
for both classes (diabetic and non-diabetic):

Fig. 4. Accuracy of SVM

Accuracy Score:
The accuracy score of the SVM model is:

Accuracy (SVM) = 0.75

Fig. 2. Accuracy of Logistic Regression E. Comparison of Models


The table below summarizes the key evaluation metrics
Accuracy Score: (accuracy, precision, recall, and F1-score) for both the Logistic
The accuracy score of the Logistic Regression model is Regression and SVM models:
calculated as: Insights from the Comparison:
- The SVM model slightly outperforms the Logistic Re-
Accuracy (Logistic Regression) = 0.74 gression model in terms of accuracy (75% vs 74%). - Logistic
Metric Logistic Regression SVM
Accuracy 0.74 0.75
diabetes risk. For instance, higher glucose levels and BMI
Precision (Class 0) 0.80 0.80 are typically associated with an increased likelihood of
Precision (Class 1) 0.62 0.64 diabetes.
Recall (Class 0) 0.79 0.81
Recall (Class 1) 0.62 0.62
• Relevance of findings in healthcare:
F1-Score (Class 0) 0.80 0.81 Early detection and prediction of diabetes are crucial in
F1-Score (Class 1) 0.62 0.63 healthcare, as timely interventions can prevent or delay
TABLE I
C OMPARISON OF M ODEL M ETRICS : L OGISTIC R EGRESSION VS SVM the onset of severe complications. The findings from this
analysis, particularly the performance of the predictive
models, highlight the importance of using machine learn-
ing models in clinical settings. These models can assist
Regression has better recall for class 0 (non-diabetic), with healthcare providers in identifying at-risk patients and tar-
a recall of 0.79 compared to 0.81 for SVM. However, SVM geting preventive measures more effectively. The ability
performs slightly better for class 1 (diabetic), with a recall of to predict diabetes based on clinical data can lead to more
0.62 compared to 0.62 for Logistic Regression. - Both models personalized and proactive healthcare management.
exhibit comparable precision for class 0 (0.80 for both), while
SVM has a slight advantage in precision for class 1 (0.64 vs V. C ONCLUSION
0.62). - The F1-scores for both models are similar, with SVM In this analysis, we compared the effectiveness of Logistic
having a slight edge for both classes (0.81 for class 0 and 0.63 Regression and Support Vector Machine (SVM) models in
for class 1, compared to Logistic Regression’s 0.80 and 0.62). predicting diabetes based on clinical data.
Based on these results, while both models perform similarly, • Both models performed similarly, with SVM showing a
SVM slightly outperforms Logistic Regression in terms of slight edge in accuracy, precision, and recall, particularly
precision and recall for class 1 (diabetic). This suggests that for classifying diabetic patients. This indicates that SVM
SVM might be marginally more suitable for distinguishing may be more suitable for distinguishing between diabetic
between diabetic and non-diabetic cases in this dataset. and non-diabetic cases, especially when dealing with
IV. D ISCUSSION clinical datasets like this one.
• Data scaling significantly improved the performance of
1) Observations:
the SVM model, highlighting the importance of proper
• Which model performed better and why?
preprocessing steps when working with machine learning
The SVM model performed slightly better than Logistic algorithms.
Regression in terms of accuracy, precision, and recall,
Recommendations for healthcare professionals:
particularly for class 1 (diabetic). This suggests that
• The findings suggest that machine learning models like
SVM might be more effective in distinguishing between
diabetic and non-diabetic cases. Its ability to handle high- SVM can be valuable tools for predicting diabetes early
dimensional data and the margin maximization principle based on clinical indicators. Healthcare professionals
likely contributed to this slight edge. could use such models to identify patients at risk and
• Impact of data scaling on SVM performance:
provide timely interventions to prevent the onset of
Data scaling played a significant role in improving the diabetes and related complications.
• It is recommended to use data preprocessing techniques
performance of the SVM model. SVM is sensitive to the
scale of the features, so scaling the data ensures that all like scaling and feature selection to enhance model ac-
features contribute equally to the model. Without scaling, curacy and effectiveness.
• In clinical practice, these models can be integrated into
the model might give more importance to features with
larger values, leading to suboptimal performance. decision support systems to assist healthcare providers
• Challenges faced during analysis:
in making informed decisions about patient care and
One of the main challenges was ensuring that the data was diabetes prevention.
appropriately preprocessed, especially handling missing VI. R EFERENCES
values and scaling the data correctly. Additionally, tuning
Dataset Source: Predict Diabetes Database.
the hyperparameters of the models to achieve optimal
Available at: https://www.kaggle.com/datasets/whenamancodes/predict-
performance required careful experimentation and testing.
diabities.
2) Insights:
• Key factors contributing to diabetes:
The dataset includes various clinical features, such as
glucose levels, blood pressure, and BMI, which are
strongly correlated with the likelihood of developing
diabetes. Feature importance analysis (e.g., using tree-
based models or logistic regression coefficients) could
help identify which features have the most influence on

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy