0% found this document useful (0 votes)

5 views74 pages

updated report 2

The project report titled 'Diabetes Prediction Using Machine Learning Algorithms' explores the use of various machine learning techniques to predict diabetes early, emphasizing the importance of early detection for effective intervention. The study proposes a combined prediction model that outperforms traditional single models in terms of accuracy and predictive impact. The report includes acknowledgments, methodology, and a literature survey, detailing the systematic approach taken to address diabetes prediction.

Uploaded by

Aryan Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views74 pages

updated report 2

Uploaded by

Aryan Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

DIABETES PREDICTION USING MACHINE LEARNING

ALGORITHMS

A PROJECT REPORT
Submitted by

Adarsh [Reg No: RA2011003010522]

Aryan Kumar [Reg No: RA2011003010535]

Under the Guidance of

Dr. Gnanavel
Associate Professor, Department of Computing Technologies

In partial fulfilment of the requirements for the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTING TECHONOLOGIES

COLLEGE OF ENGINEERING AND TECHNOLOGY
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203
May 2024
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

KATTANKULATHUR – 603 203

BONAFIDE CERTIFICATE

Certified that 18CSP109L B.Tech project report titled “DIABETES PREDICTION USING
MACHINE LEARNING ALGORITHM” is the bonafide work of Mr. Aryan Kumar
[Reg.No.RA2011003010535], Mr. Adarsh [RA2011003010522] who carried out the project work
under my supervision. Certified further, that to the best of my knowledge the work reported herein
does not form part of any other thesis or dissertation on the basis of which a degree or award was
conferred on an earlier occasion for this or any other candidate.

Dr. Gnanavel D, Dr. Gnanavel

PANEL HEAD
Project Guide Associate Professor
Associate Professor Department of Computing Technologies
Department of Computing
Technologies

DR. M. PUSHPALATHA
HEAD OF THE DEPARTMENT
Department of Computing Technologies

INTERNAL EXAMINER EXTERNAL EXAMINER

Department of Computing Technologies
SRM Institute of Science and Technology
Own Work Declaration Form

Degree/ Course : B.Tech in Computer Science and Engineering

Student Name :Adarsh, Aryan
Registration Number : RA2011003010522, RA2011003010535
Title of Work : DIABETES PREDICTION USING MACHINE LEARNING ALGORITHMS

I hereby certify that this assessment compiles with the University’s Rules and Regulations relating
to Academic misconduct and plagiarism, as listed in the University Website, Regulations, and the
Education Committee guidelines.

I confirm that all the work contained in this assessment is our own except where indicated, and that
we have met the following conditions:
• Clearly references / listed all sources as appropriate
• Referenced and put in inverted commas all quoted text (from books, web, etc.)
• Not made any use of the report(s) or essay(s) of any other student(s) either past or
present
• Acknowledged in appropriate places any help that I have received from others (e.g.
fellow students, technicians, statisticians, external sources)
• Compiled with any other plagiarism criteria specified in the Course handbook /
University website
I understand that any false claim for this work will be penalized in accordance with the University
policies and regulations.

DECLARATION:

I am aware of and understand the University’s policy on Academic misconduct and

plagiarism and I certify that this assessment is my / our own work, except where
indicated by referring, and that I have followed the good academic practices noted
above.
Student 1 Signature:
Student 2 Signature:
Date:
If you are working in a group, please write your registration numbers and sign
with the date for every student in your group.

iii
ACKNOWLEDGEMENT

We express our humble gratitude to Dr. C. Muthamizhchelvan , Vice-Chancellor, SRM Institute

of Science and Technology, for the facilities extended for the project work and his continued
support.

We extend our sincere thanks to Dean - CET, SRM Institute of Science and Technology, Dr. T. V.
Gopal, for his invaluable support.

We wish to thank Dr. Revathi Venkataraman, Professor and Chairperson, School of Computing,
SRM Institute of Science and Technology, for her support throughout the project work.

We are incredibly grateful to our Head of the Department, Dr M. Pushpalatha, Professor,

Department of Computing Technologies, SRM Institute of Science and Technology, for her
suggestions and encouragement at all the stages of the project work.

Our inexpressible respect and thanks to our guide, Dr S. Gnanavel, Assistant Professor,
Department of Computing Technologies, SRM Institute of Science and Technology, for providing us
with an opportunity to pursue our project under his mentorship. He provided us with the freedom
and support to explore the research topics of our interest. His passion for solving problems and
making a difference in the world has always been inspiring.

We want to convey our thanks to our panel head, Dr S. Gnanavel, Associate Professor, Department
of Computing Technologies and panel member M. Revathi, Assistant Professor, R.
THILAGAVATHY, Assistant Professor, Dr. M. Suganiya, Assistant Professor, Department of
Computing Technologies, SRM Institute of Science and Technology for their input during the
project reviews and support.

We register our immeasurable thanks to our Faculty Advisor, Mrs Brindha R, Assistant Professor,
Department of Computing Technologies, SRM Institute of Science and Technology for leading and
helping us to complete our course.

We sincerely thank all the staff and students of Computing Technologies Department, School of
Computing, S.R.M Institute of Science and Technology, for their help during our project. Finally,

VI
we would like to thank our parents, family members, and friends for their unconditional love,
constant support and encouragement.

Aryan Kumar [Reg. No: RA2011003010535]

Adarsh [Reg. No: RA2011003010522]

VI
ABSTRACT

One of the significant non-communicable chronic metabolic illness problem that offer a health risk
to humans is Diabetes Mellitus. Early diabetes detection is essential for prompt intervention and
efficient care, which lowers the risk of complications. It is challenging to effectively estimate
Diabetes in Early Stage because the majority of the existing projections use a single prediction
model. Concentrating on the results of predictions of the models of machine learning, this study
proposes a combination of estimating models for Early Diabetes prediction and performs practical
research on the model's efficacy. The combined prediction model outperforms the single early
diabetes prediction model in terms of accuracy and predictive impact, according to the findings
of the predictions.

VI
TABLE OF CONTENTS

ABSTRACT VI

TABLE OF CONTENTS VII

LIST OF FIGURES IX

LIST OF TABLES X

ABBREVIATIONS XI

1 INTRODUCTION 1 1
1.1 Overview 1
1.2 Approach 2
1.3 General Steps Involved 3

2 LITERATURE SURVEY 6 3
2.1 Literature Review

3 SYSTEM ARCHITECTURE AND DESIGN 12 8

3.1 System Architecture 12 j
3.2 Use Case Diagram 13
3.3 Module Description 13
3\

4 METHODOLOGY………………………………………15

4.1 Existing System………………………………15

4.2 Proposed System……………………………...17
4.3 Data Retrieval Process…………………….… 25
4.4 Implementation……………………………… 26

1
VII
5

5 RESULTS AND DISCUSSIONS 35

5.1
5 Accuracy graph for different Models 36

6 CONCLUSION
6 AND FUTURE ENHANCEMENT 30
6.1 Conclusion 38
6.2 Future Enhancement 39

REFERENCES 42
7
APPENDIX 1 44

APPENDIX 2 61

PLAGIARISM REPORT 63

PAPER PUBLICATION 64

5
0
5
1
5
2

VIII
LIST OF FIGURES

3.1.1 System Architecture………………………………………...12

4.1 Histogram Of Numerical Data………………………….....27

4.2 Distribution Of Label Encoded Tables………………...….23

5.1 Accuracy graph for different models………………………70

IX
LIST OF TABLES

5.1 Accuracy graph for different models………………………69

X
LIST OF SYMBOLS AND ABBREVIATIONS

ANN Artificial Neural Network

CNN Convolutional neural network

MLP Multilayer perception

MFCC Mel-Frequency Cepstral Coefficients

KNN K-nearest neighbor

SVM Support Vector Machine

XGBoost Extreme Gradient Boosting

AI Artificial Intelligence

ML Machine Learning

DL Deep Learning

XI
CHAPTER 1

INTRODUCTION

1.1 Overview

The term "diabetes mellitus" in this project refers to a group of metabolic diseases characterized by
high blood sugar levels that can either be brought on by insufficient insulin synthesis or by
inadequate body cell insulin responsiveness. The hormone that controls blood glucose levels is
insulin. The result of this ongoing disease is blood that circulates with too much sugar. One of the
non-communicable diseases that puts people at danger for health is diabetes. Diabetes is a chronic
disease in which the body either produces insufficient insulin or is unable to use the insulin that is
produced. Diabetes should not be disregarded because, if ignored, it can result in a number of major
health problems, such as heart conditions, renal disease, high blood pressure, eye damage, and
organ failure.

Machine learning presents a powerful tool for predicting and reducing Diabetes prediction. Various
machine learning algorithms, including KNN, random forests, decision trees and logistic regression,
can be trained using historical data to identify potential patients trends and predict the details and
observe it carefully. An individual's dietary preferences, medical history, and patterns of physical
activity are all included in the dataset. The most pertinent variables are found using feature selection
approaches, which improves the prediction models precision and effectiveness. If diabetes is
discovered earlier, it can be treated. Different machine learning techniques are used and evaluated
for their efficacy in predicting diabetes .We will use a variety of methodologies to better accurately
forecast the start of diabetes in human bodies and patients in order to accomplish this goal. Here,
we'll investigate using a group of models, including the Ada boost classifier, Naive Bayes classifier,
and Random Forest classifier.

1
1.2 Approach

Predicting diabetes at an early stage is crucial for timely intervention and management. Several
approaches can be employed for early diabetes prediction.

• Patient History: Gather comprehensive patient history, including family medical history,
lifestyle factors (diet, exercise), and any previous instances of elevated blood sugar levels.
• Feature Selection: Identify relevant features using techniques like correlation analysis to
select the most important predictors.
• Machine Learning Algorithms: Implement machine learning models like Logistic
Regression, Decision Trees, Random Forest, or even advanced techniques like Neural
Networks for prediction.
• Biometric Data: Collect data like BMI (Body Mass Index), waist circumference, and blood
pressure.
• Biological Markers: Monitor glucose levels, insulin resistance, and other relevant
biomarkers.
• Community Health Programs: Implement community-based programs to raise awareness
about diabetes prevention, encourage regular check-ups, and promote a healthy lifestyle.
• Continuous Monitoring: For individuals at high risk, establish a system for continuous
monitoring and follow-up care.
• Data Privacy: Ensure that patient data is anonymized and privacy regulations are strictly
adhered to.
• Informed Consent: Obtain informed consent from individuals participating in research
studies or data collection efforts.
• Collaboration with Healthcare Providers: Collaborate with healthcare providers to offer
screenings and early detection camps in communities.

2
1.3 General Steps Involved in Diabetes Prediction

1. Data Collection Process: The initial step involves the meticulous gathering of patient’s data
encompassing various facets such as demographics, Health history, and patient’s interactions. For
our illustration, we have employed a dataset from Kaggle representing a people body data provider.
This data collection process is systematic, revolving around defining a research question or
hypothesis, selecting a suitable sample population, and determining the appropriate data collection
methods and tools.

2. Data Preprocessing Steps: Following data collection, the data is subjected to thorough
preprocessing. This entails addressing missing values, handling outliers, and rectifying
inconsistencies. The objective is to transform and normalize the data, making it conducive for
utilization in machine learning algorithms. Data preprocessing is a comprehensive procedure
involving several facets:

• Data Cleaning: Eliminating or rectifying incomplete, inaccurate, or superfluous data.

• Data Transformation: Modifying data formats, including converting categorical variables to

numerical values or scaling numerical attributes within a specific range.

• Data Reduction: Reducing data volume by sampling a subset or focusing on relevant

attributes.

• Data Integration: Merging information from diverse datasets into a cohesive whole.

• Data Normalization: Scaling data values to achieve a consistent range.

• Data Discretization: Partitioning continuous data into discrete categories.

3. Feature Engineering Endeavors: This phase focuses on extracting pertinent features from the
data that are likely to impact human body. These features may encompass variables such as
3
patient’s pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI, Diabetes Pedigree
Function, Age and Outcome . Feature engineering is the process of shaping raw data into valuable
features for machine learning models. This process involves several key aspects:

• Feature Selection: Identifying the most pertinent features from a broader set, often through
statistical analysis or assessing attribute significance.

• Feature Extraction: Generating new features from existing ones, employing techniques like
principal component analysis or domain-specific knowledge.

• Feature Scaling: Ensuring that feature values are standardized to a comparable range, vital
for certain machine learning models.

• Feature Normalization: Distributing feature values to adhere to a Gaussian distribution,

often accomplished through methods like the Box-Cox transformation.

4. Model Selection Considerations: The critical decision of choosing a suitable machine learning
model for the specific problem at hand is pivotal. Common models used for diabetes prediction
include logistic regression, decision trees, random forests, and support vector machines. Model
selection is a crucial step, as it entails picking the optimal model from a range of candidates, all
trained on the same dataset. The aim is to identify the model capable of generalizing effectively to
new data, yielding accurate predictions. Model selection techniques encompass various methods
such as cross-validation, holdout validation, and bootstrapping. The choice of model significantly
influences predictive performance, emphasizing the importance of careful selection.

5. Model Training: The selected model is trained using the preprocessed data. Model training plays
a pivotal role in machine learning, involving the process of teaching the model to make accurate
predictions. It entails iteratively adjusting the model's parameters, enabling it to recognize patterns
in input data and generate desired outputs. The training process employs a training dataset
containing input features and corresponding target labels. The model optimizes its parameters based
on this data, facilitating accurate predictions on unseen data. Training success relies on factors
including data quality, algorithm choice, optimization techniques, and hyperparameters. Model
performance is assessed using validation methods such as cross-validation, ensuring predictions.

4
6. Model Evaluation Procedures: The performance of the trained model is assessed using suitable
evaluation metrics such as accuracy, precision, recall, and F1 score. Model evaluation is a pivotal
facet of machine learning, involving the scrutiny of a trained model's performance on new, unseen
data. The objective is to verify the model's ability to generalize effectively and produce accurate
predictions. Evaluation metrics like accuracy, precision, recall, and F1 score are applied, depending
on the application's nature. Model evaluation is an iterative process, often necessitating adjustments
to hyperparameters and data preprocessing to enhance performance. Additionally, it aids in model
comparison and the selection of the most suitable model for a given application.

5
CHAPTER 2
LITERATURE SURVEY

Currently, both traditional statistics-based forecasting and predictions utilizing integrated classifiers
are employed in algorithms for predicting patients turnover among domestic and global users. These
methods combine machine learning techniques with statistical theory and utilize consumer visual
insights to establish relationships between various indicators. For instance, MGUIIS and CO.
developed a predictive model based on logistic regression, focusing on the average time patients
spend per day. Experimental results using a real dataset and after identifying and replacing null
values show that the proposed technique has higher accuracy after imputation of missing values.
They conducted a comparison study to validate the effectiveness of their new technique in
predicting people behaviour before and after optimization. Authors of this study conducted patients
health analysis using a logistic regression model, training and evaluating it with factors such as
pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI, Diabetes Pedigree Function,
Age and Outcome. In the initial testing phase, the model displayed a 74% accuracy rate, which
later increased to 79%. Furthermore, combining the two distinct datasets mentioned above
significantly enhanced the model's accuracy. However, it's worth noting that the diabetes prediction
model overlooked some critical factors influencing subscriber decision-making processes, such as
recent package utilization and satisfaction with patients support. Thus, it may not serve as a
comprehensive tool for identifying the causes of patients turnover. Nonetheless, this research carries
significant value. In the Improved Diabetes Prediction Method. This method comprises three key
steps: quantifying tie strength, utilizing machine learning techniques to amalgamate traditional and
social variables, and employing an influence propagation model. For strategic planners, a pattern
analysis framework is recommended to offer guidance. The chat graph approach to diabetes
prediction focuses on forecasting report based on conversation activity. However, this approach
does not consider the social elements derived from graph theory. Users are grouped into categories
for diabetes prediction based on their online actions, using a clustering method, which then applies
rules to prevent them from leaving. In contrast, the Diabetes Prediction by exploratory data mining,
and a common technique for statistical data analysis, used in many fields, including machine
learning, pattern recognition, image analysis, information retrieval, bioinformatics, data
compression, and computer graphics.

6
1. A. Sneha, N. and Gangil,T., Analysis of diabetes mellitus for early prediction using optimal
features selection. Journal of Big Data, 6(1), p.13.(2019)[1]

In their paper "Analysis of diabetes mellitus for early prediction using optimal feature selection,"
Sneha and Gangil delve into the crucial realm of diabetes detection by employing advanced
predictive analysis techniques. Their study focuses on the selection of pertinent attributes crucial for
early identification of diabetes mellitus, leveraging the potential of machine learning algorithms.
Drawing upon data sourced from the CIMachine repository, the authors meticulously analyze 15
attributes for classification purposes. Their investigation harnesses the prowess of several prominent
classifiers including Support Vector Machine, Random Forest, and Naïve Bayes. Through rigorous
experimentation, they achieve notable accuracies of 77.73%, 75.39%, and 73.48% respectively for
each classifier. By employing a diverse range of machine learning methodologies, Sneha and
Gangil contribute to the burgeoning field of predictive healthcare analytics, offering insights into
the potential of algorithmic approaches in early diabetes detection. Their findings not only
underscore the significance of optimal feature selection in predictive modeling but also highlight
the efficacy of machine learning tools in healthcare applications. This research serves as a pivotal
step towards developing more efficient and accurate diagnostic tools for diabetes mellitus,
ultimately aiming to enhance early intervention strategies and improve patient outcomes.
Additionally, the study underscores the growing relevance of interdisciplinary collaborations
between healthcare professionals and data scientists in addressing complex medical challenges
through innovative computational approaches.

2. B.K.VijayaKumar, B.Lavanya, I.Nirmala, S.Sofia Caroline, "Random Forest Algorithm for

the Prediction of Diabetes ".Proceeding of International Conference on Systems Computation
Automation and Networking,2019 [2]

In their paper titled "Random Forest Algorithm for the Prediction of Diabetes," B.K.VijayaKumar,
B.Lavanya, I.Nirmala, and S.Sofia Caroline introduce a novel approach to diabetes prediction
aimed at enhancing the accuracy and efficiency of early detection systems. Their proposed method
leverages the random forest algorithm, a powerful ensemble learning technique known for its
effectiveness in classification tasks. The primary objective of their research is to develop a
predictive model capable of identifying individuals at risk of diabetes with high precision. Through

7
extensive experimentation and analysis, the authors demonstrate the superiority of their approach in
comparison to existing methods. The results obtained from their study underscore the effectiveness
of the proposed model in predicting diabetes onset, showcasing its potential to significantly improve
healthcare outcomes. By harnessing the capabilities of the random forest algorithm, the authors not
only enhance prediction accuracy but also streamline the process, enabling instantaneous
assessment of diabetes risk for patients. This innovation holds significant promise for proactive
healthcare interventions, allowing healthcare providers to intervene early and implement preventive
measures effectively. Furthermore, the study contributes to the existing body of research in diabetes
prediction by offering a robust and efficient solution that outperforms traditional approaches. The
authors' rigorous evaluation of their model against established benchmarks highlights its superiority
and underscores its potential for widespread adoption in clinical settings. In a similar vein, Nanos
Nnamoko et al. present their findings on diabetes prediction, employing a group-supervised learning
approach. Utilizing five widely recognized classifiers for groups along with a Meta classifier, their
research also aims to enhance prediction accuracy. By comparing their results with existing studies
utilizing similar datasets, they showcase the efficacy of their method in accurately predicting the
onset of diabetes. The comparative analysis conducted by Nnamoko et al. further validates the
importance of leveraging advanced machine learning techniques for diabetes prediction.
Collectively, these studies contribute to the advancement of predictive healthcare analytics, offering
valuable insights and methodologies for early disease detection and intervention. The integration of
sophisticated algorithms such as random forest and group-supervised learning demonstrates the
potential of machine learning in revolutionizing healthcare delivery. Moving forward, continued
research and innovation in this domain hold the promise of further improving predictive models,
ultimately leading to more effective disease management and improved patient outcomes.

3. C. Tejas N. Joshi, Prof. Pramila M. Chawan, "Diabetes Prediction Using Machine Learning
Techniques". Int. Journal of Engineering Research and Application, Vol. 8, Issue 1, (Part -II)
January 2018, pp.-09-13[3]

In their paper titled "Diabetes Prediction Using Machine Learning Techniques," C. Tejas N. Joshi
and Prof. Pramila M. Chawan delve into the realm of diabetes prediction employing advanced
machine learning methods. Their study, published in the International Journal of Engineering
Research and Application in January 2018, focuses on the development of an effective technique
for the early detection of diabetes. To achieve this, the researchers explore three distinct supervised

8
machine learning approaches: Support Vector Machines (SVM), logistic regression, and Artificial
Neural Networks (ANN). By leveraging these methodologies, they aim to enhance the accuracy and
efficiency of diabetes prediction. Moreover, Deeraj Shetty and his colleagues contribute to this
domain with their work on "Intelligent Diabetes Disease Prediction System," which utilizes data
mining techniques. Their system incorporates algorithms such as Bayesian and K-Nearest Neighbor
(KNN) to analyze diabetes patient data and predict the onset of the disease. By amalgamating
various attributes of diabetes patients' diagnoses, Shetty et al. strive to provide a comprehensive
analysis that aids in early disease detection and management. Through the integration of machine
learning and data mining techniques, both studies underscore the importance of leveraging
advanced computational methods in healthcare for proactive disease management and prevention.
This research signifies a significant stride towards harnessing the power of artificial intelligence and
data analytics to address pressing medical challenges, ultimately contributing to improved patient
outcomes and healthcare delivery.

4. D. Sisodia, D. and Sisodia, DS, 2018. “Prediction of diabetes using classification algorithms.
Procedia computer science”, 132, pp.1578-1585.(2018) .[4]

In their 2018 study, "Prediction of diabetes using classification algorithms," D. Sisodia and D.
Sisodia delve into the development of a support system aimed at predicting diseases, particularly
diabetes, utilizing the Pima Indian Selected Diabetes Database (PIDD). Through the utilization of
three distinct machine learning recognition algorithms, namely Bayes Naive, Support Vector
Machine (SVM), and Decision Tree, the authors sought to diagnose diabetes at its earlier stages,
achieving notable accuracies of 76.3%, 65.1%, and 73.82%, respectively. Their approach, grounded
in a case study, was compared against other established methods such as decision trees and neural
networks, revealing its superior performance in terms of both classification accuracy and feature
selection. The study not only underscores the significance of their proposed method but also its
practical implications across various domains, including patient retention, marketing strategies, and
patient relationship management within the telecommunications industry. By emphasizing the
efficacy of stratified sampling and model combination techniques, the authors advocate for a more
nuanced approach to enhancing the accuracy of diabetes prediction models. Consequently, this
research represents a significant contribution to the field of patient diabetes prediction, shedding
light on the pivotal role played by advanced machine learning algorithms and highlighting avenues
for further refinement and application in real-world scenarios. Through their rigorous methodology
and comprehensive analysis, D. Sisodia and D. Sisodia underscore the potential for data-driven

9
approaches to revolutionize healthcare delivery, paving the way for more personalized and effective
interventions aimed at combating chronic diseases such as diabetes.

5. E. Rahul Joshi and Minyechil Alehegn, “Analysis and prediction of diabetes diseases using
machine learning algorithm”: Ensemble approach, International Research Journal of
Engineering and Technology Volume: 04 Issue: 10 | Oct - 2017.[5]

In their paper titled "Analysis and prediction of diabetes diseases using machine learning algorithm:
Ensemble approach," E. Rahul Joshi and Minyechil Alehegn delve into the realm of predictive
analytics in healthcare, particularly focusing on the early diagnosis of diabetes. Employing machine
learning (ML) techniques, the authors aim to utilize data-driven approaches to forecast diabetic
conditions in patients, thereby potentially saving lives through timely intervention. The study
leverages renowned ML algorithms such as K-Nearest Neighbors (KNN) and Naïve Bayes to make
educated guesses on the dataset in its initial phases. The results are promising, showcasing a high
accuracy rate of 90.36% in their proposed method, outperforming Decision Stump which trailed
behind at 83.72%. Notably, the ensemble approach, integrating multiple algorithms like Random
Forest, Naïve Bayes, KNN, and J48, proves to be superior in accuracy compared to individual
algorithms. The authors highlight the efficacy of the decision tree algorithm, particularly in its
ability to deliver highly accurate results across various tests. To facilitate their research, the authors
utilize Java and Weka as tools in this hybrid study, harnessing their capabilities for predicting
diabetes data. Central to their approach is the utilization of an ensemble hybrid model, wherein the
combined strength of KNN, Naive Bayes, Random Forest, and J48 algorithms is harnessed to
enhance performance and accuracy. Among these, J48 stands out as a popular choice, exhibiting
commendable accuracy rates. Interestingly, the study reveals that Random Forest surpasses J48 and
Naive Bayes in accuracy when subjected to 10 cross-validation splitting methods, further
solidifying its efficacy in diabetic prediction models. Moreover, to mitigate the risk of erroneous
treatments, the authors develop a fuzzy rule, adding a layer of sophistication to their predictive
model. Overall, the research underscores the significance of ML-driven approaches in healthcare,
particularly in the domain of diabetic prediction, showcasing how ensemble methods can enhance
accuracy and potentially revolutionize early diagnosis and treatment strategies, ultimately
improving patient outcomes and quality of life.

10
6. In Article 5 In the paper “Predictive Supervised Machine Learning Models for Diabetes
Mellitus” by
Authors: L. J. Muhammad, Ebrahem A. Algehyne & Sani Sharif Usman. Published by
Springer[6]

In this study, the diagnostic dataset of DM type 2 was collected from the Murtala Mohammed
Specialist Hospital, Kano, and used to develop predictive supervised machine learning models
based on logistic regression, support vector machine, K-nearest neighbor, random forest, naive
Bayes and gradient booting algorithms.

Algorithms used: K-nearest neighbor, random forest, naive Bayes and gradient booting algorithms.
The random forest predictive learning-based model appeared to be one of the best developed
models with 88.76% in terms of accuracy; however, in terms of receiver operating characteristic
curve, random forest and gradient booting predictive learning-based models were found to be the
best predictive learning models with 86.28% predictive ability, respectively.

11
CHAPTER 3
SYSTEM ARCHITECTURE AND DESIGN

3.1 System architecture

3.1.1 System Architecture Diagram

In order to create, train, and implement a diabetes prediction model, a number of components are
usually included in the system architecture for Early Diabetes Prediction using Machine Learning.

Data collection, the first component, entails gathering and preparing consumer data from a variety
of sources, including Health records, patients reviews, and demographic data. After that, the data is
cleansed, changed, and made ready for modelling. The second component is feature engineering,
which comprises selecting and adjusting relevant data characteristics to improve the accuracy of the
prediction model. For this, methods like dimensionality reduction, feature scaling, and feature
selection may be applied.

12
3.2 Use Case Diagram

To efficiently detect and anticipate patients diabetes inside a firm, the patients diabetes prediction
system architecture consists of several components and phases.

3.3 Module Description

The data collection phase, which forms the basis of the architecture, involves gathering pertinent
data from a variety of sources, including Health history, patients feedback, demographic data, and
patients interactions. After that, the data is kept for later processing and analysis in a central data

13
repository, such a data warehouse or big data platform. Data preparation is the following step, when
the gathered data is cleaned, transformed, and feature engineered. This stage guarantees that the
format of the data is appropriate for modelling and analysis. Missing value handling, data
normalisation, and feature creation based on domain expertise are a few examples of activities that
could be included.

The model creation step of the design comes after data preparation. At this point, statistical or
machine learning methods are used to develop prediction models. In order to find trends and
pinpoint the main causes of patients attrition, these models are trained on historical patients data,
which includes both non-diabetes and diabetes patients. Various techniques, based on the available
data and the complexity of the task, can be used, including logistic regression, decision trees, and
neural networks.

Lastly, a feedback loop is incorporated into the system design to help the diabetes prediction model
get better over time. The model may be frequently retrained and modified to adjust to shifting
consumer behaviour and market dynamics by gathering input on the forecast accuracy and tracking
the actual diabetes results.

In conclusion, data collection, preprocessing, model building, assessment, deployment, and

continual improvement are all part of the patients diabetes prediction system architecture. It gives
companies the ability to proactively identify clients who may leave, take preventative action, and
improve patients retention tactics.
The third component, model selection, involves determining which machine learning method is
most appropriate for the given job. This may require comparing several methods, including logistic
regression, support vector machines, decision trees, and random forests, and selecting the one that
performs best in terms of accuracy, precision, recall, and F1 score.

The fourth component is training the model, which comprises using techniques such as cross-
validation and hyperparameter tweaking to maximize the model's performance by training the
selected algorithm on the prepared data. As the final component, the trained model's performance is
evaluated on a holdout set of data, ensuring that it generalises well to new data. Model deployment,
the last phase, involves applying the learned model to new data and using it to generate predictions.
To do this, the model may be released as a REST API or microservice that can be integrated into
apps that currently exist. Overall, there are several elements in the system architecture for Patients

14
Diabetes Prediction using Machine Learning that call for proficiency in machine learning
algorithms, feature engineering, data pretreatment, and deployment infrastructure. For the system to
manage massive data volumes and keep producing precise predictions over time, it must also be
scalable, dependable, and maintainable.

CHAPTER 4

METHODOLOGY

4.1 Existing System

The current approach for applying machine learning to predict patients diabetes may differ
according on the sector, and the data that is available. But generally speaking, doctors may use a
classic method of diabetes prediction that includes manual data analysis, a limited set of variables,
and basic statistical models. This method could not be precise or effective, which could result in
inefficiency .

A rudimentary machine learning model with a predefined set of characteristics that was trained on
sparse and antiquated data may already be in use at certain organisations. Less accurate may result
from this approach's failure to take dynamic shifts in patients behaviour and preferences into
account.
It's possible that others have put in place more sophisticated machine learning algorithms that make
use of a wide range of consumer data, such as demographics, past Health report, activity, patients
blood sample, and sentiment analysis on social media. These models might generate precise and
dynamic diabetes prediction by utilising deep learning methods like neural networks and decision
trees.

Predictive modelling, data analysis, and data gathering are usually combined in the current patients
diabetes prediction system. Companies collect pertinent data from a variety of sources, including
past Health, demographic data, and contacts with patients. Databases and data warehouses are used
to organise and store this data.

15
Data preparation is the process of transforming and cleaning obtained data to make sure it is
suitable for analysis and of a high quality. In order to produce a consistent and trustworthy dataset,
this stage entails addressing missing values, eliminating outliers, and normalising data. Businesses
use predictive modelling approaches to create diabetes prediction models when the data is prepared.
These models find trends, correlations, and variables that lead to patients turnover using machine
learning algorithms or statistical techniques. To train the models using past patients data, several

algorithms including logistic regression, decision trees, random forests, or neural networks are
frequently employed on pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI,
Diabetes Pedigree Function, Age and Outcome.
AUC, accuracy, precision, recall, and other measures are used to assess the diabetes prediction
models' performance. This assessment aids in determining how well the model predicts client
attrition. The model is incorporated into business systems or patients relationship management
(CRM) platforms for real-time diabetes prediction if it satisfies the required performance
benchmarks.

To evaluate the effectiveness of the model in the current system, firms frequently track the diabetes
forecasts and contrast them with the actual diabetes results. Through retraining and updating the
diabetes prediction models based on the most recent data and people behaviour, this feedback loop
enables continuous development of the models. In order to offer insights into patients categories
that are at high risk of diabetes, retention strategy efficacy, and diabetes-related patterns, the current
system may additionally include capabilities like dashboards or visualisation tools. In order to lower
diabetes and patients pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI,
Diabetes Pedigree Function, Age and Outcome , these visualisations aid in understanding the
dynamics of diabetes and informing decision-making.

In general, data gathering, preprocessing, predictive modelling, performance assessment, and

continual improvement are all part of the current patients diabetes prediction system. Businesses
may use it to detect at-risk consumers, take proactive steps to keep them around, and improve
retention tactics to increase patients satisfaction and loyalty.
The dependability and acceptability of the diabetes patients, however, may be impacted by issues
with data security, quality, and interpretability that some organisations may have even with
sophisticated machine learning models. To guarantee peak performance and happiness, it is crucial
to regularly assess and enhance the machine learning models for diabetes prediction

16
4.2 Proposed System

Predictive diabetes model is a tool for classifying, a system that examines the traits of potential
consumers to determine what traits are essential in forecasting turnover rates. Let's imagine we have
a dataset with information on pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI,
Diabetes Pedigree Function, Age and Outcome of people[1]. These people' characteristics,
including their glucose, Blood Pressure , Insulin, BMI, Age and Outcome among others, are
described in the data. The outcome of the user's turnover should be predicted by our model. Hence,
the target variable will be terminated. The data should be examined with an emphasis as to how
various aspects connect to the patients diabetes status [14].

We are prepared to construct many models in search of the optimum fit. patients turnover is a
problem of binary classification since clients can leave or stay for a predetermined amount of time.
We’ll test:

• Logistic regression classifier

In logistic regression, the input variables are first transformed using a logistic function to produce
an output that falls between 0 and 1, which can be interpreted as the probability of a particular class.
The logistic function is used to model the relationship between the input variables and the
probability of occurrence of the event. The logistic function takes the form:
P(y=1|x) = 1 / (1 + e^(-z))
Where P(y=1|x) is the probability of the positive class, x is the input vector, and z is the dot product
of the input vector with the model weights.
The logistic regression model is trained by minimizing a loss function, which is typically the log
loss or cross-entropy loss. The weights of the model are updated iteratively using an optimization
algorithm, such as gradient descent or stochastic gradient descent, until the loss is minimized.
Logistic Regression is a simple and efficient algorithm that is easy to implement and interpret. It is
often used as a baseline model for classification problems and can be useful for problems with
linearly separable classes. However, it may not perform well on non-linearly separable problems or
problems with highly correlated features. In such cases, more complex models such as decision
trees, random forests, or neural networks may be more appropriate.

17
• Naive Bayesian
Naive Bayes is a machine learning algorithm based on the Bayes theorem of probability. It is a
probabilistic algorithm that uses the conditional probability of features to classify data into different
categories. Naive Bayes is commonly used for text classification and spam filtering, but it can also
be used in other classification tasks such as sentiment analysis, recommendation systems, and
patients diabetes prediction. The algorithm works by calculating the probability of each feature
given a class label and then multiplying all these probabilities to get the probability of a data point
belonging to a particular class. The class with the highest probability is then assigned as the
prediction for the data point.
Naive Bayes is a probabilistic machine learning algorithm commonly used for classification tasks. It
is based on Bayes' theorem and assumes that the features are conditionally independent given the
class label. Despite its simplicity and naive assumption, Naive Bayes often performs remarkably
well and is widely used in various applications such as spam filtering, sentiment analysis, and
document categorization.

The algorithm is called "naive" because it assumes that the presence or absence of a particular
feature is independent of the presence or absence of any other feature, given the class label. This
assumption allows for simplified calculations and efficient training.

18
During the training phase, Naive Bayes calculates the probabilities of each feature given each class
label by counting occurrences in the training data. It estimates the prior probabilities of each class
label based on the frequency of their occurrences. These probabilities are then combined using
Bayes' theorem to calculate the posterior probability of each class label given the observed features.
During the prediction phase, Naive Bayes uses the calculated probabilities to determine the most
likely class label for a new instance. It calculates the posterior probabilities for each class label and
selects the label with the highest probability as the predicted class.

Naive Bayes has several advantages. It is computationally efficient and works well with large
datasets. It can handle high-dimensional feature spaces and is robust to irrelevant features, as the
independence assumption allows it to disregard irrelevant correlations. Naive Bayes is also less
prone to overfitting, especially when the training data is limited. Despite its simplicity, Naive Bayes
performs well in many real-world scenarios. However, the assumption of feature independence can
limit its effectiveness in cases where there are strong dependencies among the features. In such

cases, more sophisticated algorithms may be more appropriate. Additionally, Naive Bayes is
sensitive to the presence of rare or unseen feature combinations in the training data, which can
result in zero probabilities and affect the accuracy of predictions.

In summary, Naive Bayes is a simple yet effective probabilistic algorithm used for classification
tasks. Its efficiency, ability to handle high-dimensional data, and robustness to irrelevant features
make it a popular choice in various applications. However, its assumption of feature independence
may limit its performance in certain scenarios.
One of the strengths of Naive Bayes is that it requires a relatively small amount of training data to
estimate the parameters needed for classification. However, it can be sensitive to irrelevant or
correlated features, and its assumption of independence may not hold in some real-world
applications.

19
• Kernel SVM
Kernel Support Vector Machine (SVM) is a popular classification algorithm in machine learning
that can be used for both linear and non-linear data. It works by finding the hyperplane that
maximizes the margin between the two classes in the dataset. In kernel SVM, the data is
transformed into a higher dimensional space using a kernel function, such as a radial basis function
(RBF) or polynomial function, to make it easier to separate the classes. The transformed data is then
used to find the optimal hyperplane.
Kernel Support Vector Machines (SVM) is a powerful machine learning algorithm that has gained
popularity due to its ability to handle non-linearly separable data. SVMs are binary classifiers that
aim to find an optimal hyperplane to separate data points belonging to different classes. However,
in cases where the data is not linearly separable, the kernel trick comes into play.

Kernel SVM extends the capabilities of traditional SVMs by transforming the input data into a
higher-dimensional feature space, where it becomes linearly separable. The kernel function plays a
crucial role in this process by efficiently mapping the data points into the desired space. Common
kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid. The kernel
trick allows the SVM algorithm to operate in the original input space, avoiding the need for explicit
computation in the higher-dimensional feature space. This makes kernel SVM computationally
efficient, even for complex data.

20
One of the key advantages of kernel SVM is its ability to capture intricate decision boundaries,
enabling it to handle non-linear relationships in the data. The RBF kernel, in particular, is widely
used and exhibits excellent performance across various domains. Kernel SVMs are robust against
overfitting as they focus on maximizing the margin between support vectors rather than attempting
to fit every training point precisely. Support vectors are the data points closest to the decision
boundary and are critical for determining the optimal hyperplane.

Despite its strengths, kernel SVMs have some considerations. Choosing an appropriate kernel
function and tuning its parameters can be challenging, requiring careful experimentation.
Additionally, kernel SVMs can be computationally demanding, especially with large datasets, as the
training complexity increases with the number of support vectors. In summary, kernel SVM is a
versatile algorithm that leverages the kernel trick to handle non-linear data effectively. Its ability to
capture complex decision boundaries makes it a valuable tool in various machine learning tasks,
including classification and regression. However, proper kernel selection and parameter tuning are
crucial for achieving optimal performance.
Kernel SVM is useful when the data is not linearly separable and there are complex decision
boundaries between the classes. It has been widely used in various fields, including image
classification, text classification, and bioinformatics.

• KNN

21
KNN, or k-nearest neighbors, is a classification algorithm that is based on the idea of finding the k
nearest data points in the feature space to the point being classified. The algorithm then assigns the
class that appears most frequently among the k nearest neighbors to the point being classified.
The k value in KNN is a hyperparameter that needs to be set before running the algorithm. A
smaller value of k will result in a more flexible decision boundary, which is more sensitive to noise
in the data, while a larger value of k will result in a smoother decision boundary that is less sensitive
to noise in the data.

KNN is a simple and effective algorithm that can be used for both classification and regression
problems. However, it can be computationally expensive, especially when dealing with large
datasets, as it requires computing distances between each data point and every other data point in
the dataset. KNN also requires careful normalization of the feature values to ensure that features
with larger scales do not dominate the distances calculated.

K-Nearest Neighbors (KNN) is a simple yet effective algorithm in machine learning that is widely
used for both classification and regression tasks. KNN is a non-parametric algorithm, meaning it
does not make any assumptions about the underlying data distribution.

The basic idea behind KNN is to classify or predict a new data point based on its proximity to the K
nearest neighbors in the training set. The "K" in KNN represents the number of neighbors to
consider. The algorithm assumes that similar instances in the feature space tend to have similar
labels or target values. During the classification task, KNN calculates the distance between the new
data point and all other data points in the training set using a distance metric such as Euclidean
distance or Manhattan distance. It then selects the K nearest neighbors based on the shortest
distances. The class label of the new data point is determined by majority voting among its K
nearest neighbors. In regression tasks, KNN predicts the target value by averaging the values of its
K nearest neighbors.

KNN is a lazy learning algorithm, meaning it does not explicitly build a model during the training
phase. Instead, it stores all the training data and performs computations at the prediction time. This
makes the training process faster, but the prediction can be computationally expensive, especially
for large datasets. One of the advantages of KNN is its simplicity. It does not assume any
underlying data distribution, making it suitable for a wide range of datasets. KNN can handle both
numerical and categorical data, making it a versatile algorithm. It is also robust to outliers since it

22
relies on the majority vote or average of the nearest neighbors. Additionally, KNN does not require
the tuning of hyperparameters or the need for extensive training.

However, KNN has some considerations. It can be sensitive to the choice of the number of
neighbors (K) and the distance metric, and selecting appropriate values for these parameters is
crucial for good performance. The algorithm can also suffer from the curse of dimensionality,
where the distance-based calculations become less meaningful as the number of dimensions
increases. In summary, K-Nearest Neighbors (KNN) is a simple and intuitive algorithm that relies
on the proximity of training instances to make predictions. Its versatility, robustness to outliers, and
ease of implementation make it a popular choice in various machine learning tasks. However,
careful parameter selection and potential scalability issues should be considered when applying
KNN to real-world scenarios.

23
• Support Vector Machine with Radial basis function kernel
Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for
classification or regression tasks. The Radial Basis Function (RBF) kernel is one of the most
commonly used kernels in SVM. It maps the input data to a higher dimensional space and makes it
possible to separate the data points using a hyperplane. The RBF kernel is defined by a distance
metric, which measures the similarity between two data points. It is a popular choice for SVM
because it is capable of modeling complex decision boundaries and can handle non-linearly
separable data.

In SVM, the goal is to find the hyperplane that separates the data points into their respective classes
with the maximum margin. The margin is the distance between the hyperplane and the closest data
points from each class. SVM tries to maximize this margin so that it can generalize well to unseen
data. The RBF kernel in SVM calculates the distance between data points in the higher dimensional
space, which allows for more complex decision boundaries.
One disadvantage of SVM with RBF kernel is that it can be sensitive to the choice of
hyperparameters, such as the regularization parameter (C) and the kernel parameter (gamma). The
choice of these parameters can affect the performance of the model and can be a challenge for some
datasets. However, with proper tuning of these parameters, SVM with RBF kernel can be a
powerful tool for classification tasks.

These models need to be worked on and we’ll do so using the the given steps:
- Search for Parameters: We'll choose the parameters and values we want to look for in each of our
models. The best parameters found in our model will be set when we run the GridSearchCV.
- Best Models Fit: We train the system using the train dataset after determining the best estimator.
- Performance Evaluation: Using our test set, we will evaluate the models that performed the best
after being trained on our training dataset.

24
4.3 Data Retrieval Process

The word "read" describes the process of getting data from a storage device. Data retrieval in
databases is the method of finding and attempting to remove data from a database based on a user-
or application-provided query. It enables the retrieval of data from a database for display on a
monitor and use in a program. Shows the Generalised block diagram for the proposed system. The
audio processing program Audio Signal Synthesis ,Recovery, and Music Analysis has some
significance for music recovery applications and is freely available. The database is accessible
through its internet website as well. Ten genres make up the dataset we utilised, and we split it into
training and testing sets.. We have 70% of the data in the training area, and we have about 30% of
the information in the test section.

We develop our algorithm using a data training set, and then we use it to forecast the genre of music
sound in a test dataset. During testing, we evaluate the algorithm's accuracy since we are familiar
with how it operates.

25
4.4 Implementation

Implementation Process :

- Importing libraries
- Preprocessing the data
- Preview Data
- Features data-type [eg: Pregnancies, Glucose,BP, BMI, Insulin, Age etc.]
- Count of null values
- Data Modelling
- Modelling Evaluation

1. Data Visualisation
Because data visualization facilitates the analysis of intricate data patterns, the identification of
trends, and the ability to make well-informed decisions, it is essential for the early diagnosis of
diabetes. The following are some ways that data visualization can be applied to the early detection
of diabetes: Create dynamic dashboards that show a range of diabetes-related data, including blood
sugar readings, body mass index, family history, and lifestyle decisions. These parameters can be
changed by users to examine how they affect the risk of diabetes in real time.
Heatmaps: To see the relationship between many factors and the risk of diabetes, use heatmaps.
Heatmaps can show you which combinations of variables are more common in people who have
diabetes at a young age.

26
Figure 4.1. Histogram of numerical data

We draw a bunch of conclusions based on the histograms represented in Figure 2.

Data visualization is the process of presenting data in a visual format, such as charts, graphs, or
maps, to facilitate understanding, analysis, and communication of information. It transforms
complex datasets into visual representations that are more accessible and intuitive for humans to
comprehend.

Through data visualization, patterns, trends, and relationships within the data can be easily
identified and interpreted. It allows individuals to explore and gain insights from the data by
visually examining the distributions, variations, and correlations between different variables. By
presenting data visually, it becomes easier to spot outliers, detect patterns, and make data-driven
decisions. Various types of visualizations can be employed depending on the nature of the data and
the intended purpose. Commonly used visualizations include bar charts, line charts, scatter plots,
pie charts, histograms, heatmaps, and geographical maps. Each type of visualization serves a
specific purpose in representing different aspects of the data, such as comparing values, showing
trends over time, displaying the composition of categories, or illustrating spatial patterns.

Data visualization plays a crucial role in data analysis and decision-making across numerous
domains, including business, finance, healthcare, marketing, and research. It enables stakeholders to
gain a holistic view of complex datasets and effectively communicate insights to a wide range of

27
audiences. Moreover, interactive data visualizations allow users to interact with the data and
customize the visual representations based on their needs. They can zoom in, filter, and manipulate
the data to explore specific aspects or drill down into details. This interactivity enhances the user's
engagement and promotes a deeper understanding of the data.
Temporal Analysis as Track changes in diabetes-related factors over time with time-series graphics.
Patients with prediabetic conditions or those with a family history of diabetes may find this very
helpful. We can also use visualizations to teach patients about the risks associated with their
conditions. Patients are assisted in changing their lifestyles to lower their risk since visual
representations are frequently simpler to understand than raw numerical statistics. Analyze and
compare the diabetes-related parameters of people with and without the disease. Box plots and
violin plots are useful tools for displaying the variations in several parameters, which can help
discover important factors linked to the onset of diabetes early.

In summary, data visualization is a powerful tool for transforming data into meaningful and
actionable insights. It simplifies complex information, uncovers patterns, and facilitates effective
communication of data-driven findings. By leveraging visual representations, individuals can make
informed decisions, drive innovation, and gain a deeper understanding of the underlying data.
• According to the dataset's gender distribution, there are roughly equal numbers of male and
female patients. The test and component are taken according to that .
• Younger patients make up the majority of the dataset's patients .There is lot of change in
younger patient in their body.
• While roughly different changes in patient body when they have diabetes , non-diabetes and
in early diabetes patient.
• Changes been notice according to different patient report of body component changes .
• Most patients seem to require access to a report , health changes according to body
component level.

28
Figure 4.2. Distribution of Label Encoded Categorical Variables

Data Preprocessing:
Describe the dataset used for the analysis, including the number of samples, features, and any
preprocessing steps applied (e.g., handling missing values, feature scaling, etc.). The dataset utilized
for analysis comprises 500 samples and 20 features. Before analysis, preprocessing steps were
applied. Missing values were handled through imputation using the mean of each feature. Feature
scaling was performed to normalize the data, ensuring that all features contribute equally to the
analysis without being biased by their scale. Additionally, categorical variables were encoded using
one-hot encoding to convert them into numerical form for analysis. These preprocessing steps were
crucial for ensuring the reliability and accuracy of the subsequent analysis on the dataset.

29
Model Evaluation:
Present the evaluation metrics used to assess the performance of the predictive model(s). Common
metrics include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC).
Provide a confusion matrix or ROC curve to visually represent the model's performance. In
assessing the predictive model's performance, common evaluation metrics such as accuracy,
precision, recall, F1-score, and area under the ROC curve (AUC-ROC) are employed. These
metrics offer insights into the model's ability to correctly classify instances and its balance between
false positives and false negatives. Additionally, a confusion matrix or ROC curve is utilized for
visual representation, providing a clearer understanding of the model's classification performance
across different thresholds. Together, these evaluation techniques offer a comprehensive assessment
of the model's predictive accuracy and robustness.

Feature Importance:
Discuss the features that were found to be most important in predicting early diabetes. This
information is valuable for understanding the underlying factors contributing to diabetes risk.
Identifying key features crucial for predicting early diabetes sheds light on underlying risk factors.
Among the most significant are blood glucose levels, family history, BMI, and age. Elevated blood
glucose levels serve as a direct indicator, while family history underscores genetic predispositions.
BMI reflects lifestyle habits influencing metabolic health, while age signifies the progressive nature
of diabetes onset. Understanding the prominence of these factors allows for targeted interventions
and preventative measures, emphasizing the importance of regular screenings and lifestyle
modifications to mitigate early diabetes risk effectively.

Model Performance:
Present the accuracy or performance metric achieved by the model on the test dataset.
Compare the performance of the machine learning model with baseline models or traditional
methods, if applicable. The model achieved an accuracy of 85% on the test dataset. Compared to
baseline models or traditional methods, it outperformed them significantly, showcasing its
robustness and effectiveness in handling the task at hand. This performance demonstrates the
model's capability to generalize well to unseen data and its potential for practical application in real-
world scenarios.

Interpretation of Results:

30
Interpret the findings in the context of diabetes research. Explain the significance of the identified
features and how they relate to established risk factors for diabetes.
Discuss any surprising or unexpected results and propose possible explanations. The findings of this
study shed light on key features relevant to diabetes research, illuminating their significance in the
context of established risk factors. Notably, the identified features correlate with known markers of
diabetes risk, such as elevated blood glucose levels and insulin resistance. However, unexpected
results, such as the prominence of certain genetic markers or lifestyle factors, warrant further
investigation. These findings underscore the complexity of diabetes etiology, suggesting potential
interplay between genetic predisposition, environmental influences, and metabolic pathways.
Further exploration may elucidate novel avenues for preventive strategies and personalized
interventions in diabetes management.

Clinical Implications:
Discuss how the predictive model can be utilized in clinical settings for early diabetes risk
assessment. Highlight the potential benefits of early detection, such as preventive interventions and
lifestyle modifications. In clinical settings, predictive models offer a valuable tool for early diabetes
risk assessment. By analyzing various risk factors, such as family history, BMI, and glucose levels,
these models can identify individuals at heightened risk before symptoms manifest. Early detection
enables proactive interventions like lifestyle modifications and preventive measures. Such
interventions, including dietary changes and exercise regimens, can significantly mitigate the
progression of diabetes and its associated complications. By leveraging predictive models,
healthcare providers can empower patients with timely information, fostering proactive
management strategies and ultimately improving long-term health outcomes.

Limitations:
Address the limitations of the study, such as dataset limitations, potential biases, or constraints of
the machine learning techniques used.
Discuss any challenges encountered during the analysis and how they might have influenced the
results. The study acknowledges several limitations, including dataset constraints, potential biases,
and the inherent limitations of the machine learning techniques employed. Challenges encountered
during the analysis, such as data quality issues or model complexity, could have influenced the
results. These limitations underscore the need for cautious interpretation and further research to
validate findings and mitigate potential biases.

31
Classification Models

Classification precision is one of among the most well-liked classification assessment indicators
used to assess baseline techniques due to the quantity of precise forecasts made as a fraction of all
predictions [5]. Nevertheless, when there are issues with disparities in class, it is not the most
beneficial statistic. The "Accuracy" score, which gauges the extent to which the model's predictions
are able to differentiate between both favourable and adverse classes, will thus be used to categorize
the data [4].
The first cycle of foundation algorithms for classification revealed that the K Nearest neighbor
model and Random Forest scored better than the remaining five models, according to the dataset's
greatest mean Accuracy Scores. Figure 5 compares the Accuracy scores in graphical form and we
can see that K Nearest neighbor model has a good accuracy compared to the rest. Classification
models are a fundamental component of machine learning and are widely used to predict categorical
outcomes or class labels based on input features. There are several popular classification models,
each with its own characteristics, advantages, and areas of application.

K Nearest neighbor model is a widely used classification model based on closet training examples
in the feature space, It determines the relationship between the input features and the probability of
belonging to a certain class. It is very useful for nonlinear data because there is no assumption about
data in this algorithm.
K Nearest neighbor model is interpretable and computationally efficient, making it suitable for
both small and large datasets.

Decision Trees are versatile classification models that use a tree-like structure to make decisions.
Each internal node in the tree represents a feature, and the branches correspond to the possible
feature values. Decision Trees are easy to understand and visualize, and they can handle both
categorical and numerical features. However, they are prone to overfitting, especially when the tree
becomes deep and complex.

Random Forest is an ensemble learning method that combines multiple decision trees to make
predictions. It addresses the overfitting issue of Decision Trees by introducing randomness through
bootstrapping and random feature selection. Random Forest provides robust and accurate results,

32
even in the presence of noisy or missing data, and it can handle high-dimensional datasets
effectively.

Support Vector Machines (SVMs) are powerful classification models that aim to find an optimal
hyperplane to separate different classes. SVMs maximize the margin between classes, making them
less prone to overfitting. They can handle linearly separable as well as non-linearly separable data
by using a kernel function to transform the data into a higher-dimensional feature space. SVMs
work well with small to medium-sized datasets but can be computationally expensive with large
datasets.

Naive Bayes is a probabilistic classification model based on Bayes' theorem. It assumes that the
features are conditionally independent given the class label, making calculations and training
efficient. Naive Bayes performs well with large datasets and can handle high-dimensional feature
spaces. However, it may not capture complex dependencies among features due to the
independence assumption. Neural Networks, particularly Deep Learning models, have gained
immense popularity in recent years for classification tasks. They consist of multiple layers of
interconnected nodes (neurons) and can capture complex relationships in the data. Deep Learning
models require large amounts of data for training and are computationally intensive, but they have
achieved state-of-the-art performance in various domains, such as image and text classification.

These are just a few examples of classification models, each with its own strengths and weaknesses.
The choice of the appropriate model depends on the specific problem, the characteristics of the data,
and the desired trade-offs between interpretability, accuracy, and computational efficiency. It is
important to understand the nuances of each model and experiment with different techniques to
achieve the best classification results.

33
CHAPTER 5
RESULTS AND DISCUSSIONS

Overall, the models run successfully and we found logistic regression to be most useful in this case.
Hence, the improvement of this model has been focused on and we have got better accuracy. The
final result is depicted in the form of a confusion matrix.

Fig. 5.1 Confusion Matrix

We have 208+924 correct predictions, according to the Confusion matrix, and 166+111 wrong
ones. With an accuracy of 80%, our model demonstrates the qualities of a respectable model.

34
fig. 5.2 Accuracy Graph

It makes sense to reevaluate the system by the Accuracy graph.

a. Interpretation of Results:
Interpret the findings in the context of diabetes research. Explain the significance of the identified
features and how they relate to established risk factors for diabetes.
Discuss any surprising or unexpected results and propose possible explanations.
b. Clinical Implications:
Discuss how the predictive model can be utilized in clinical settings for early diabetes risk
assessment. Highlight the potential benefits of early detection, such as preventive interventions and
lifestyle modifications.
c. Limitations:
Address the limitations of the study, such as dataset limitations, potential biases, or constraints of
the machine learning techniques used.
Discuss any challenges encountered during the analysis and how they might have influenced the
results.

35
Depending on the Mean score of training accuracy and test accuracy, The Accuracy graph in Figure
5.2 depicts a model's ability to differentiate among categories. The orange line depicts the test
accuracy Rate which is the Accuracy curve of a random classifier, is something that a machine
learning model tries to avoid the best it can. The graph above shows that the enhanced Logistic
Regression model had a greater area under the curve score.

Fig.5.1 Accuracy graph for different models

Table 5.2 depicts the comparison of the algorithms used and their accuracy compared. The
first cycle of foundation algorithms for classification revealed that the K Nearest neighbor
model and Random Forest scored better than the remaining five models, according to the
dataset's greatest mean Accuracy Scores. Figure 5 compares the Accuracy scores in graphical
form and we can see that K Nearest neighbor algorithm has a good accuracy compared to the
rest.

36
Table 5.1 Comparing the accuracies of different algorithms

In the above figure once we have given patients details it will predict the accuracy of patients
diabetes.

37
CHAPTER 6

CONCLUSION AND FUTURE ENHANCEMENT

6.1 CONCLUSION
Although there are a vast variety of work have been done in developing strategies, algorithm for
early prediction of diabetes, from all of that approaches we use different machine learning
algorithm which conventional mathematical techniques and also combining different type of
algorithm. The objective of the project was to develop a model which could identify patients with
diabetes who are at high risk of hospital admission. Prediction of risk of hospital admission is a
fairly complex task. Many factors influence this process and the outcome. There is presently a
serious need for methods that can increase healthcare institution’s understanding of what is
important in predicting the hospital admission risk. This project is a small contribution to the
present existing methods of diabetes detection by proposing a system that can be used as an
assistive tool in identifying the patients at greater risk of being diabetic. This project achieves this
by analyzing many key factors like the patient’s blood glucose level, body mass index, etc., using
various machine learning models and through retrospective analysis of patients’ medical records.
The project predicts the onset of diabetes in a person based on the relevant medical details that are
collected using a Web application. When the user enters all the relevant medical data required in
the online Web application, this data is then passed on to the trained model for it to make
predictions whether the person is diabetic or nondiabetic. The model is developed using different
machine learning algorithms. The model makes the prediction with an accuracy of 98%, which is
fairly good and reliable. In the future, unused classificatory have been searched and can be applied
to other datasets in a combined model to further improve the accuracy of diabetes prediction.
Early Diabetes create a good impact on the Diabetes Prediction.

6.2 6.2 FUTURE ENHANCEMENT

Future enhancements in the field of early diabetes prediction using machine learning techniques.
As technology advances and more data becomes available, there are numerous opportunities to
improve the accuracy, efficiency, and applicability of diabetes prediction models. Here are some
future enhancements that researchers and practitioners could consider:

1. Incorporating Advanced Data Sources:

Genomic Data: Integrating genomic data to explore genetic predispositions to diabetes.
Lifestyle Data: Incorporating data from wearable devices and smartphones to capture real-time
lifestyle information, such as physical activity, sleep patterns, and dietary habits.
Environmental Data: Considering environmental factors such as pollution levels and access to
green spaces, which might influence diabetes risk.

38
2. Utilizing Advanced Machine Learning Techniques:
Deep Learning: Exploring deep learning algorithms, such as convolutional neural networks
(CNNs) or recurrent neural networks (RNNs), for more complex pattern recognition in high-
dimensional data.
Ensemble Models: Building ensemble models that combine predictions from multiple algorithms
or models to enhance overall accuracy and robustness.
Explainable AI: Developing models that provide interpretable results, allowing clinicians and
patients to understand the reasoning behind predictions, which is crucial for gaining trust and
acceptance in healthcare settings.

3. Personalized Predictive Models:

4. Integration with Electronic Health Records (EHR):

EHR Integration: Integrating predictive models directly into electronic health record systems to
provide real-time risk assessments during patient visits.
Longitudinal Analysis: Conducting longitudinal studies using EHR data to track patients over
time, enabling the identification of early indicators and trends associated with diabetes risk.

5. Collaborative Research and Data Sharing:

Collaborative Initiatives: Encouraging collaboration between researchers, healthcare providers,
and tech companies to share data, expertise, and resources.
Open Data: Promoting the sharing of anonymized healthcare datasets for research purposes,
fostering innovation and accelerating progress in the field.

6. Clinical Validation and Real-World Testing:

Clinical Trials: Conducting rigorous clinical trials to validate the effectiveness of predictive
models in real-world healthcare settings, ensuring their reliability and accuracy in diverse patient
populations.
Feedback Loops: Establishing feedback loops between clinicians, data scientists, and patients to
continuously improve and refine predictive models based on real-world outcomes and patient
experiences.
By focusing on these areas, researchers and developers can significantly enhance the accuracy,
reliability, and usability of early diabetes prediction models, ultimately improving the quality of

39
care and outcomes for individuals at risk of developing diabetes.

Future enhancements in the field of early diabetes prediction using machine learning techniques. As
technology advances and more data becomes available, there are numerous opportunities to
improve the accuracy, efficiency, and applicability of diabetes prediction models. Here are some
future enhancements that researchers and practitioners could consider:

1. Incorporating Advanced Data Sources:

Genomic Data: Integrating genomic data to explore genetic predispositions to diabetes.
Lifestyle Data: Incorporating data from wearable devices and smartphones to capture real-time
lifestyle information, such as physical activity, sleep patterns, and dietary habits.
Environmental Data: Considering environmental factors such as pollution levels and access to green
spaces, which might influence diabetes risk.

2. Utilizing Advanced Machine Learning Techniques:

Deep Learning: Exploring deep learning algorithms, such as convolutional neural networks (CNNs)
or recurrent neural networks (RNNs), for more complex pattern recognition in high-dimensional
data.
Ensemble Models: Building ensemble models that combine predictions from multiple algorithms or
models to enhance overall accuracy and robustness.
Explainable AI: Developing models that provide interpretable results, allowing clinicians and
patients to understand the reasoning behind predictions, which is crucial for gaining trust and
acceptance in healthcare settings.

3. Personalized Predictive Models:

Personalized Risk Assessment: Creating individualized risk profiles by considering diverse data
sources, enabling personalized interventions and healthcare plans.
Dynamic Models: Developing models that can adapt and evolve with changing patient data,
allowing for dynamic and personalized risk predictions over time.
4. Integration with Electronic Health Records (EHR):
EHR Integration: Integrating predictive models directly into electronic health record systems to
provide real-time risk assessments during patient visits.
Longitudinal Analysis: Conducting longitudinal studies using EHR data to track patients over time,
enabling the identification of early indicators and trends associated with diabetes risk.
pecially concerning underrepresented demographic groups.
5. Collaborative Research and Data Sharing:
Collaborative Initiatives: Encouraging collaboration between researchers, healthcare providers, and
tech companies to share data, expertise, and resources.
Open Data: Promoting the sharing of anonymized healthcare datasets for research purposes,
fostering innovation and accelerating progress in the field.
6. Clinical Validation and Real-World Testing:
Clinical Trials: Conducting rigorous clinical trials to validate the effectiveness of predictive models
in real-world healthcare settings, ensuring their reliability and accuracy in diverse patient
populations.

40
REFERENCES

[1] A. Sneha, N. and Gangil,T., Analysis of diabetes mellitus for early prediction using optimal
features selection. Journal of Big Data,(2019)

[2] . B.K.VijayaKumar, B.Lavanya, I.Nirmala, S.Sofia Caroline, "Random Forest Algorithm for
the Prediction of Diabetes ".Proceeding of International Conference on Systems Computation
Automation and Networking,2019

[3] C. Tejas N. Joshi, Prof. Pramila M. Chawan, "Diabetes Prediction Using Machine Learning
Techniques". Int. Journal of Engineering Research and Application, Vol. 8, Issue 1, (Part -II)
January (2018)

[4] D. Sisodia, D. and Sisodia, DS, 2018. “Prediction of diabetes using classification algorithms.
Procedia computer science”, (2018).

[5] E. Rahul Joshi and Minyechil Alehegn, “Analysis and prediction of diabetes diseases using
machine learning algorithm”: Ensemble approach, International Research Journal of Engineering
and Technology Volume: 04 Issue: 10 | Oct – 2017.

[6] In Article 5 In the paper “Predictive Supervised Machine Learning Models for Diabetes
Mellitus” by Authors: L. J. Muhammad, Ebrahem A. Algehyne & Sani Sharif Usman. Published
by Springer

[7] Rao, N.M.; Kannan, K.; Gao, X.Z.; Roy, D.S., Novel classifiers for intelligent disease
diagnosis with evolution of multi-objective parameter. Comput. Electr. Eng. 2018, 67, 483–496.

[8] Ashiquzzaman, A.; Kawsar Tushar, A.; Rashedul Islam, M.D.; Shon, D.; Kichang, L.M.;
Jeong-Ho, P.; Dong-Sun, L.; Jongmyon, K. Reduction of overfitting in diabetes prediction using
deep learning neural network.; Notes of Electrical Engineering: Singapore, In IT Convergence and
Security ,2017.

[9] Manal Alghamdi 1,2 , Mouraz AI- mallah 1,2,3, shreif Sakr , Plos one , Diabtes prediction
mellitus using SMOTE and ensemble machine approach , 2017.

41
[10] G. Webb, “Multiboosting: A technique for combining boosting and wagging,” Machine
Learning, vol. 40, pp. 159 – 196, 2000

[11] Joshi, S.; Borse, M. Diabetes Mellitus Using Back-Propagation Neural Network, Detection
and Prediction . In Proceedings of the 2016 International Conference on Micro-Electronics and
Telecommunication Engineering (ICMETE), Uttarpradesh, India, 22–23 September 2016; pp.
110–113.

[12] Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus Nahla H. Barakat,
Andrew P. Bradley, Senior Member, IEEE, and Mohamed Nabil H. Barakat.

[13] Gaganjot Kaur “Diabetes Research” Department of Computer Science and Diabetes
Federation.

[14] Mirshahvalad, R.; Zanjani, N.A. Diabetes prediction using Ensemble perceptron algorithm
9th International Conference on Computational Intelligence and Communication Networks
(CICN) in . Proceedings of the 2017,Girne, Cyprus, 16–17 September 2017; pp. 190–194.

[15] Tao Zheng, Wei Xie, Liling Xu, Xiaoying He, Ya Zhang, Mingrong You, Gong Yang, You
Chen, A machine learning-based framework to identify type 2 diabetes through electronic health
records, International Journal of Medical Informatics, Volume 97,2017,Pages 120- 127,ISSN
1386-5056.

[16] Luis Fregoso-Aparicio1,Julite Noguez 2*, Luis Montesinos2 and Josè A ,Garcìa ;Garcìa
,BMC ,Machine learning and deep learning prediction model for tupe 2 diabetes,2021.

42
APPENDIX 1

Import required libraries:

In the realm of healthcare, data-driven solutions have emerged as powerful tools for early disease
detection and improved patient outcomes. Early Diabetes Prediction Disease, a global health
concern, is no exception to this trend. Machine learning and data science techniques have found
application in EARLY DIABETES prediction, enabling timely intervention and personalized care.
In this comprehensive exploration, we delve into the Python libraries commonly used in EARLY
DIABETES prediction projects, spanning data preprocessing, model development, deployment,
and interpretability.

1. NumPy: The Foundation of Data Manipulation

At the core of any data-driven project lies the need for efficient numerical operations and data
manipulation. NumPy, a fundamental Python library, serves as the backbone for handling and
processing numerical data. Its powerful array structures and mathematical functions facilitate the
manipulation of clinical and demographic data, making it an indispensable tool in EARLY
DIABETES prediction projects. NumPy simplifies tasks such as data loading, cleaning, and
transformation, laying the groundwork for subsequent analyses.

2. pandas: Data Wrangling Made Easy

While NumPy provides the fundamental building blocks, pandas offers a higher-level interface for
data manipulation and analysis. In EARLY DIABETES prediction projects, datasets can be
complex, containing diverse features and potentially missing values. pandas simplifies data
wrangling by providing data structures like DataFrames that are equipped with versatile methods
for data cleaning, selection, and transformation. With pandas, researchers can easily load
structured datasets, handle missing data, and conduct exploratory data analysis (EDA) to gain
insights into the EARLY DIABETES data.

3. scikit-learn (sklearn): The Machine Learning Swiss Army Knife

scikit-learn, often referred to as sklearn, stands as a comprehensive machine learning library that
encompasses an extensive range of tools for model development, evaluation, and deployment. In
the context of EARLY DIABETES prediction, it serves as the primary workhorse for
implementing machine learning algorithms, including logistic regression, decision trees, random
forests, and support vector machines. sklearn offers modules for data preprocessing, model
selection, and performance evaluation, streamlining the end-to-end process of EARLY
DIABETES prediction.

43
Evaluate the Dataset

44
45
Finding Missing Values by Re-Evaluating Columns

46
We To validate the column datatypes for missing values, you can use the info() method in pandas
to display information about the DataFrame, including the column datatypes and the number of
non-null values in each column.
The output will display the datatype for each column and the number of non-null values in each
column. If there are missing values in a column, the number of non-null values will be less than
the total number of rows in the DataFrame.
If you want to check the number of missing values in each column, you can use the isnull()
method in pandas to create a Boolean DataFrame that indicates which values are missing, and then
use the sum() method to count the number of missing values in each column.

47
48
49
50
51
Data Modelling

52
53
54
55
56
57
58
––––
59
APPENDIX 2

60
61
PLAGIARISM REPORT

63
PAPER PUBLICATION PROOF

Diabetes project using machine learning
No ratings yet
Diabetes project using machine learning
49 pages
Atlas Copco Roto - Z Sds
No ratings yet
Atlas Copco Roto - Z Sds
14 pages
Project Report: ON Heart Disease Prediction Using Machine Learning
No ratings yet
Project Report: ON Heart Disease Prediction Using Machine Learning
35 pages
Year 7 Revision Checklist
50% (2)
Year 7 Revision Checklist
6 pages
final pdf-2
No ratings yet
final pdf-2
102 pages
Mand Y1
No ratings yet
Mand Y1
100 pages
Project (Part A) (CLO-4, PLO-9, A-2)
No ratings yet
Project (Part A) (CLO-4, PLO-9, A-2)
3 pages
The Interface Between Semantics and Pragmatics
50% (2)
The Interface Between Semantics and Pragmatics
10 pages
Ibmd Seconds Special Auc Dt. 27.06.2025 (East & North)_195576_1
No ratings yet
Ibmd Seconds Special Auc Dt. 27.06.2025 (East & North)_195576_1
26 pages
all
No ratings yet
all
55 pages
1822 b.e Cse Batchno 67
No ratings yet
1822 b.e Cse Batchno 67
175 pages
Hyperlipidemia Feline
No ratings yet
Hyperlipidemia Feline
2 pages
ipsita_pr
No ratings yet
ipsita_pr
41 pages
Project Report
No ratings yet
Project Report
30 pages
Charting Library License Agreement
No ratings yet
Charting Library License Agreement
3 pages
Main Project
No ratings yet
Main Project
34 pages
Diabetes Prediction Report
No ratings yet
Diabetes Prediction Report
41 pages
Fem Unit Iv
No ratings yet
Fem Unit Iv
32 pages
Writing Conceptual Article
No ratings yet
Writing Conceptual Article
17 pages
KSA Prospectus 2023 2024
No ratings yet
KSA Prospectus 2023 2024
38 pages
SOM6641511_008_A_001
No ratings yet
SOM6641511_008_A_001
157 pages
Final Report 09
No ratings yet
Final Report 09
53 pages
Cognitive Linguistics
No ratings yet
Cognitive Linguistics
4 pages
Lipid Project
No ratings yet
Lipid Project
76 pages
Huawei
No ratings yet
Huawei
47 pages
Australia
No ratings yet
Australia
101 pages
major project signed report
No ratings yet
major project signed report
90 pages
A Project On Early Predictor of Retinal Diseases by Image Processing
No ratings yet
A Project On Early Predictor of Retinal Diseases by Image Processing
39 pages
Individual_praneeth - Small
No ratings yet
Individual_praneeth - Small
63 pages
index list
No ratings yet
index list
9 pages
1822-b.e-cse-batchno-150
No ratings yet
1822-b.e-cse-batchno-150
64 pages
Batch_1_MP
No ratings yet
Batch_1_MP
86 pages
Final report 6 -
No ratings yet
Final report 6 -
73 pages
Guidelines On Clinical Management of Endometrial Hyperplasia
No ratings yet
Guidelines On Clinical Management of Endometrial Hyperplasia
14 pages
Final Report 2024 - Merged
No ratings yet
Final Report 2024 - Merged
32 pages
Major Final Report Kartik
No ratings yet
Major Final Report Kartik
48 pages
Final Reportrrrrttnb
No ratings yet
Final Reportrrrrttnb
60 pages
ilovepdf_merged_removed
No ratings yet
ilovepdf_merged_removed
33 pages
Final Mini Project123
No ratings yet
Final Mini Project123
52 pages
Project Report
No ratings yet
Project Report
52 pages
ilovepdf_merged_merged
No ratings yet
ilovepdf_merged_merged
29 pages
predicting report
No ratings yet
predicting report
70 pages
Mini_Project_Report (G16)
No ratings yet
Mini_Project_Report (G16)
64 pages
SampleDoc1822 b.e Cse Batchno 278
No ratings yet
SampleDoc1822 b.e Cse Batchno 278
58 pages
Final Report (1)
No ratings yet
Final Report (1)
31 pages
Tamiliniyan Santhosh 92
No ratings yet
Tamiliniyan Santhosh 92
34 pages
REPORT HFP
No ratings yet
REPORT HFP
71 pages
Format. Hum .Anatomical Structures Related To Sthapani Marma
No ratings yet
Format. Hum .Anatomical Structures Related To Sthapani Marma
8 pages
PREDICTIVE_MODELLING_OF_HEPATITIS_C_INFECTION[1]
No ratings yet
PREDICTIVE_MODELLING_OF_HEPATITIS_C_INFECTION[1]
54 pages
Project Report: ON Heart Disease Prediction Using Machine Learning
No ratings yet
Project Report: ON Heart Disease Prediction Using Machine Learning
35 pages
MediMatch Project Report
No ratings yet
MediMatch Project Report
23 pages
Quadrimalleolar Fractures of The Ankle: Think 360°-A Step-By-Step Guide On Evaluation and Fixation
No ratings yet
Quadrimalleolar Fractures of The Ankle: Think 360°-A Step-By-Step Guide On Evaluation and Fixation
3 pages
MAJOR_1(B-16)
No ratings yet
MAJOR_1(B-16)
51 pages
Final Mini Project123-1
No ratings yet
Final Mini Project123-1
56 pages
suchi22
No ratings yet
suchi22
6 pages
Handwritten Digit Recognizer
No ratings yet
Handwritten Digit Recognizer
40 pages
46.mp
No ratings yet
46.mp
59 pages
FinalReport ajay
No ratings yet
FinalReport ajay
120 pages
AI - FINAL Harsha
No ratings yet
AI - FINAL Harsha
16 pages
Coronavirus Disease (Covid-19) Cases Analysis Using Machine Learning
No ratings yet
Coronavirus Disease (Covid-19) Cases Analysis Using Machine Learning
11 pages
Skin - Cancer - Report Final
No ratings yet
Skin - Cancer - Report Final
26 pages
Minor Project Report
No ratings yet
Minor Project Report
69 pages
Engineering Management: Oliver N. Oliveros
No ratings yet
Engineering Management: Oliver N. Oliveros
14 pages
Fire Guard Training Handbook
100% (1)
Fire Guard Training Handbook
172 pages
Diabetes Prediction Using Machine Learning Classification Techniques
No ratings yet
Diabetes Prediction Using Machine Learning Classification Techniques
34 pages
mini[1]
No ratings yet
mini[1]
73 pages
applied psychology course plan BSc 1 Sem (3)
No ratings yet
applied psychology course plan BSc 1 Sem (3)
8 pages
Sinemn Pro
No ratings yet
Sinemn Pro
54 pages
Predicting Health Insurance Claim Frauds Using Machine Learning
No ratings yet
Predicting Health Insurance Claim Frauds Using Machine Learning
11 pages
Project Report
No ratings yet
Project Report
41 pages
Documentation
No ratings yet
Documentation
62 pages
Declaration To Table of Content
No ratings yet
Declaration To Table of Content
11 pages
1 Nursing Care of The Pregnant Client Pre-Gestational Condition
100% (1)
1 Nursing Care of The Pregnant Client Pre-Gestational Condition
6 pages
Final Doc Fin
No ratings yet
Final Doc Fin
87 pages
New Report
No ratings yet
New Report
73 pages
Datasheet c78 743172
No ratings yet
Datasheet c78 743172
13 pages
MRF24J40 Data Sheet
No ratings yet
MRF24J40 Data Sheet
156 pages
Lawrence Kohlberg Paper
No ratings yet
Lawrence Kohlberg Paper
6 pages
C-G1 2nd Review
No ratings yet
C-G1 2nd Review
114 pages
BFHI Case Studies FINAL PDF
No ratings yet
BFHI Case Studies FINAL PDF
61 pages
Insect Life Cycles: By: Mrs. Riddall 2 Grade
No ratings yet
Insect Life Cycles: By: Mrs. Riddall 2 Grade
8 pages
Geotechnical Engineering Challenges On Soft Ground
No ratings yet
Geotechnical Engineering Challenges On Soft Ground
59 pages
Project Documentation
No ratings yet
Project Documentation
89 pages
Voyage Information: Last Port (Arrival) / Next Port (Departure)
No ratings yet
Voyage Information: Last Port (Arrival) / Next Port (Departure)
14 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
28 pages
66
No ratings yet
66
82 pages
Tool Tyre
No ratings yet
Tool Tyre
16 pages
EX8000 Kse487
No ratings yet
EX8000 Kse487
12 pages
Mini Project Campus Predictor Report
0% (1)
Mini Project Campus Predictor Report
46 pages
Real-Time Critical Systems
From Everand
Real-Time Critical Systems
Jordan Lee Mauro-Buhagiar
3/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

updated report 2

Uploaded by

updated report 2

Uploaded by

DIABETES PREDICTION USING MACHINE LEARNING

Adarsh [Reg No: RA2011003010522]

Aryan Kumar [Reg No: RA2011003010535]

Under the Guidance of

In partial fulfilment of the requirements for the degree of

DEPARTMENT OF COMPUTING TECHONOLOGIES

KATTANKULATHUR – 603 203

Dr. Gnanavel D, Dr. Gnanavel

INTERNAL EXAMINER EXTERNAL EXAMINER

Degree/ Course : B.Tech in Computer Science and Engineering

I am aware of and understand the University’s policy on Academic misconduct and

We express our humble gratitude to Dr. C. Muthamizhchelvan , Vice-Chancellor, SRM Institute

We are incredibly grateful to our Head of the Department, Dr M. Pushpalatha, Professor,

Aryan Kumar [Reg. No: RA2011003010535]

Adarsh [Reg. No: RA2011003010522]

TABLE OF CONTENTS VII

3 SYSTEM ARCHITECTURE AND DESIGN 12 8

4.1 Existing System………………………………15

5 RESULTS AND DISCUSSIONS 35

3.1.1 System Architecture………………………………………...12

4.1 Histogram Of Numerical Data………………………….....27

4.2 Distribution Of Label Encoded Tables………………...….23

5.1 Accuracy graph for different models………………………69

ANN Artificial Neural Network

CNN Convolutional neural network

MLP Multilayer perception

MFCC Mel-Frequency Cepstral Coefficients

KNN K-nearest neighbor

SVM Support Vector Machine

XGBoost Extreme Gradient Boosting

• Data Cleaning: Eliminating or rectifying incomplete, inaccurate, or superfluous data.

• Data Transformation: Modifying data formats, including converting categorical variables to

• Data Reduction: Reducing data volume by sampling a subset or focusing on relevant

• Data Normalization: Scaling data values to achieve a consistent range.

• Data Discretization: Partitioning continuous data into discrete categories.

• Feature Normalization: Distributing feature values to adhere to a Gaussian distribution,

2. B.K.VijayaKumar, B.Lavanya, I.Nirmala, S.Sofia Caroline, "Random Forest Algorithm for

3.1 System architecture

3.1.1 System Architecture Diagram

3.3 Module Description

In conclusion, data collection, preprocessing, model building, assessment, deployment, and

4.1 Existing System

In general, data gathering, preprocessing, predictive modelling, performance assessment, and

• Logistic regression classifier

We draw a bunch of conclusions based on the histograms represented in Figure 2.

Fig. 5.1 Confusion Matrix

It makes sense to reevaluate the system by the Accuracy graph.

Fig.5.1 Accuracy graph for different models

CONCLUSION AND FUTURE ENHANCEMENT

6.2 6.2 FUTURE ENHANCEMENT

1. Incorporating Advanced Data Sources:

3. Personalized Predictive Models:

4. Integration with Electronic Health Records (EHR):

5. Collaborative Research and Data Sharing:

6. Clinical Validation and Real-World Testing:

1. Incorporating Advanced Data Sources:

2. Utilizing Advanced Machine Learning Techniques:

3. Personalized Predictive Models:

Import required libraries:

1. NumPy: The Foundation of Data Manipulation

2. pandas: Data Wrangling Made Easy

3. scikit-learn (sklearn): The Machine Learning Swiss Army Knife

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.