updated report 2
updated report 2
ALGORITHMS
A PROJECT REPORT
Submitted by
BONAFIDE CERTIFICATE
Certified that 18CSP109L B.Tech project report titled “DIABETES PREDICTION USING
MACHINE LEARNING ALGORITHM” is the bonafide work of Mr. Aryan Kumar
[Reg.No.RA2011003010535], Mr. Adarsh [RA2011003010522] who carried out the project work
under my supervision. Certified further, that to the best of my knowledge the work reported herein
does not form part of any other thesis or dissertation on the basis of which a degree or award was
conferred on an earlier occasion for this or any other candidate.
DR. M. PUSHPALATHA
HEAD OF THE DEPARTMENT
Department of Computing Technologies
I hereby certify that this assessment compiles with the University’s Rules and Regulations relating
to Academic misconduct and plagiarism, as listed in the University Website, Regulations, and the
Education Committee guidelines.
I confirm that all the work contained in this assessment is our own except where indicated, and that
we have met the following conditions:
• Clearly references / listed all sources as appropriate
• Referenced and put in inverted commas all quoted text (from books, web, etc.)
• Not made any use of the report(s) or essay(s) of any other student(s) either past or
present
• Acknowledged in appropriate places any help that I have received from others (e.g.
fellow students, technicians, statisticians, external sources)
• Compiled with any other plagiarism criteria specified in the Course handbook /
University website
I understand that any false claim for this work will be penalized in accordance with the University
policies and regulations.
DECLARATION:
iii
ACKNOWLEDGEMENT
We extend our sincere thanks to Dean - CET, SRM Institute of Science and Technology, Dr. T. V.
Gopal, for his invaluable support.
We wish to thank Dr. Revathi Venkataraman, Professor and Chairperson, School of Computing,
SRM Institute of Science and Technology, for her support throughout the project work.
Our inexpressible respect and thanks to our guide, Dr S. Gnanavel, Assistant Professor,
Department of Computing Technologies, SRM Institute of Science and Technology, for providing us
with an opportunity to pursue our project under his mentorship. He provided us with the freedom
and support to explore the research topics of our interest. His passion for solving problems and
making a difference in the world has always been inspiring.
We want to convey our thanks to our panel head, Dr S. Gnanavel, Associate Professor, Department
of Computing Technologies and panel member M. Revathi, Assistant Professor, R.
THILAGAVATHY, Assistant Professor, Dr. M. Suganiya, Assistant Professor, Department of
Computing Technologies, SRM Institute of Science and Technology for their input during the
project reviews and support.
We register our immeasurable thanks to our Faculty Advisor, Mrs Brindha R, Assistant Professor,
Department of Computing Technologies, SRM Institute of Science and Technology for leading and
helping us to complete our course.
We sincerely thank all the staff and students of Computing Technologies Department, School of
Computing, S.R.M Institute of Science and Technology, for their help during our project. Finally,
VI
we would like to thank our parents, family members, and friends for their unconditional love,
constant support and encouragement.
VI
ABSTRACT
One of the significant non-communicable chronic metabolic illness problem that offer a health risk
to humans is Diabetes Mellitus. Early diabetes detection is essential for prompt intervention and
efficient care, which lowers the risk of complications. It is challenging to effectively estimate
Diabetes in Early Stage because the majority of the existing projections use a single prediction
model. Concentrating on the results of predictions of the models of machine learning, this study
proposes a combination of estimating models for Early Diabetes prediction and performs practical
research on the model's efficacy. The combined prediction model outperforms the single early
diabetes prediction model in terms of accuracy and predictive impact, according to the findings
of the predictions.
VI
TABLE OF CONTENTS
ABSTRACT VI
LIST OF FIGURES IX
LIST OF TABLES X
ABBREVIATIONS XI
1 INTRODUCTION 1 1
1.1 Overview 1
1.2 Approach 2
1.3 General Steps Involved 3
2 LITERATURE SURVEY 6 3
2.1 Literature Review
4 METHODOLOGY………………………………………15
1
VII
5
6 CONCLUSION
6 AND FUTURE ENHANCEMENT 30
6.1 Conclusion 38
6.2 Future Enhancement 39
REFERENCES 42
7
APPENDIX 1 44
APPENDIX 2 61
PLAGIARISM REPORT 63
PAPER PUBLICATION 64
5
0
5
1
5
2
VIII
LIST OF FIGURES
IX
LIST OF TABLES
X
LIST OF SYMBOLS AND ABBREVIATIONS
AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
XI
CHAPTER 1
INTRODUCTION
1.1 Overview
The term "diabetes mellitus" in this project refers to a group of metabolic diseases characterized by
high blood sugar levels that can either be brought on by insufficient insulin synthesis or by
inadequate body cell insulin responsiveness. The hormone that controls blood glucose levels is
insulin. The result of this ongoing disease is blood that circulates with too much sugar. One of the
non-communicable diseases that puts people at danger for health is diabetes. Diabetes is a chronic
disease in which the body either produces insufficient insulin or is unable to use the insulin that is
produced. Diabetes should not be disregarded because, if ignored, it can result in a number of major
health problems, such as heart conditions, renal disease, high blood pressure, eye damage, and
organ failure.
Machine learning presents a powerful tool for predicting and reducing Diabetes prediction. Various
machine learning algorithms, including KNN, random forests, decision trees and logistic regression,
can be trained using historical data to identify potential patients trends and predict the details and
observe it carefully. An individual's dietary preferences, medical history, and patterns of physical
activity are all included in the dataset. The most pertinent variables are found using feature selection
approaches, which improves the prediction models precision and effectiveness. If diabetes is
discovered earlier, it can be treated. Different machine learning techniques are used and evaluated
for their efficacy in predicting diabetes .We will use a variety of methodologies to better accurately
forecast the start of diabetes in human bodies and patients in order to accomplish this goal. Here,
we'll investigate using a group of models, including the Ada boost classifier, Naive Bayes classifier,
and Random Forest classifier.
1
1.2 Approach
Predicting diabetes at an early stage is crucial for timely intervention and management. Several
approaches can be employed for early diabetes prediction.
• Patient History: Gather comprehensive patient history, including family medical history,
lifestyle factors (diet, exercise), and any previous instances of elevated blood sugar levels.
• Feature Selection: Identify relevant features using techniques like correlation analysis to
select the most important predictors.
• Machine Learning Algorithms: Implement machine learning models like Logistic
Regression, Decision Trees, Random Forest, or even advanced techniques like Neural
Networks for prediction.
• Biometric Data: Collect data like BMI (Body Mass Index), waist circumference, and blood
pressure.
• Biological Markers: Monitor glucose levels, insulin resistance, and other relevant
biomarkers.
• Community Health Programs: Implement community-based programs to raise awareness
about diabetes prevention, encourage regular check-ups, and promote a healthy lifestyle.
• Continuous Monitoring: For individuals at high risk, establish a system for continuous
monitoring and follow-up care.
• Data Privacy: Ensure that patient data is anonymized and privacy regulations are strictly
adhered to.
• Informed Consent: Obtain informed consent from individuals participating in research
studies or data collection efforts.
• Collaboration with Healthcare Providers: Collaborate with healthcare providers to offer
screenings and early detection camps in communities.
2
1.3 General Steps Involved in Diabetes Prediction
1. Data Collection Process: The initial step involves the meticulous gathering of patient’s data
encompassing various facets such as demographics, Health history, and patient’s interactions. For
our illustration, we have employed a dataset from Kaggle representing a people body data provider.
This data collection process is systematic, revolving around defining a research question or
hypothesis, selecting a suitable sample population, and determining the appropriate data collection
methods and tools.
2. Data Preprocessing Steps: Following data collection, the data is subjected to thorough
preprocessing. This entails addressing missing values, handling outliers, and rectifying
inconsistencies. The objective is to transform and normalize the data, making it conducive for
utilization in machine learning algorithms. Data preprocessing is a comprehensive procedure
involving several facets:
• Data Integration: Merging information from diverse datasets into a cohesive whole.
3. Feature Engineering Endeavors: This phase focuses on extracting pertinent features from the
data that are likely to impact human body. These features may encompass variables such as
3
patient’s pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI, Diabetes Pedigree
Function, Age and Outcome . Feature engineering is the process of shaping raw data into valuable
features for machine learning models. This process involves several key aspects:
• Feature Selection: Identifying the most pertinent features from a broader set, often through
statistical analysis or assessing attribute significance.
• Feature Extraction: Generating new features from existing ones, employing techniques like
principal component analysis or domain-specific knowledge.
• Feature Scaling: Ensuring that feature values are standardized to a comparable range, vital
for certain machine learning models.
4. Model Selection Considerations: The critical decision of choosing a suitable machine learning
model for the specific problem at hand is pivotal. Common models used for diabetes prediction
include logistic regression, decision trees, random forests, and support vector machines. Model
selection is a crucial step, as it entails picking the optimal model from a range of candidates, all
trained on the same dataset. The aim is to identify the model capable of generalizing effectively to
new data, yielding accurate predictions. Model selection techniques encompass various methods
such as cross-validation, holdout validation, and bootstrapping. The choice of model significantly
influences predictive performance, emphasizing the importance of careful selection.
5. Model Training: The selected model is trained using the preprocessed data. Model training plays
a pivotal role in machine learning, involving the process of teaching the model to make accurate
predictions. It entails iteratively adjusting the model's parameters, enabling it to recognize patterns
in input data and generate desired outputs. The training process employs a training dataset
containing input features and corresponding target labels. The model optimizes its parameters based
on this data, facilitating accurate predictions on unseen data. Training success relies on factors
including data quality, algorithm choice, optimization techniques, and hyperparameters. Model
performance is assessed using validation methods such as cross-validation, ensuring predictions.
4
6. Model Evaluation Procedures: The performance of the trained model is assessed using suitable
evaluation metrics such as accuracy, precision, recall, and F1 score. Model evaluation is a pivotal
facet of machine learning, involving the scrutiny of a trained model's performance on new, unseen
data. The objective is to verify the model's ability to generalize effectively and produce accurate
predictions. Evaluation metrics like accuracy, precision, recall, and F1 score are applied, depending
on the application's nature. Model evaluation is an iterative process, often necessitating adjustments
to hyperparameters and data preprocessing to enhance performance. Additionally, it aids in model
comparison and the selection of the most suitable model for a given application.
5
CHAPTER 2
LITERATURE SURVEY
Currently, both traditional statistics-based forecasting and predictions utilizing integrated classifiers
are employed in algorithms for predicting patients turnover among domestic and global users. These
methods combine machine learning techniques with statistical theory and utilize consumer visual
insights to establish relationships between various indicators. For instance, MGUIIS and CO.
developed a predictive model based on logistic regression, focusing on the average time patients
spend per day. Experimental results using a real dataset and after identifying and replacing null
values show that the proposed technique has higher accuracy after imputation of missing values.
They conducted a comparison study to validate the effectiveness of their new technique in
predicting people behaviour before and after optimization. Authors of this study conducted patients
health analysis using a logistic regression model, training and evaluating it with factors such as
pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI, Diabetes Pedigree Function,
Age and Outcome. In the initial testing phase, the model displayed a 74% accuracy rate, which
later increased to 79%. Furthermore, combining the two distinct datasets mentioned above
significantly enhanced the model's accuracy. However, it's worth noting that the diabetes prediction
model overlooked some critical factors influencing subscriber decision-making processes, such as
recent package utilization and satisfaction with patients support. Thus, it may not serve as a
comprehensive tool for identifying the causes of patients turnover. Nonetheless, this research carries
significant value. In the Improved Diabetes Prediction Method. This method comprises three key
steps: quantifying tie strength, utilizing machine learning techniques to amalgamate traditional and
social variables, and employing an influence propagation model. For strategic planners, a pattern
analysis framework is recommended to offer guidance. The chat graph approach to diabetes
prediction focuses on forecasting report based on conversation activity. However, this approach
does not consider the social elements derived from graph theory. Users are grouped into categories
for diabetes prediction based on their online actions, using a clustering method, which then applies
rules to prevent them from leaving. In contrast, the Diabetes Prediction by exploratory data mining,
and a common technique for statistical data analysis, used in many fields, including machine
learning, pattern recognition, image analysis, information retrieval, bioinformatics, data
compression, and computer graphics.
6
1. A. Sneha, N. and Gangil,T., Analysis of diabetes mellitus for early prediction using optimal
features selection. Journal of Big Data, 6(1), p.13.(2019)[1]
In their paper "Analysis of diabetes mellitus for early prediction using optimal feature selection,"
Sneha and Gangil delve into the crucial realm of diabetes detection by employing advanced
predictive analysis techniques. Their study focuses on the selection of pertinent attributes crucial for
early identification of diabetes mellitus, leveraging the potential of machine learning algorithms.
Drawing upon data sourced from the CIMachine repository, the authors meticulously analyze 15
attributes for classification purposes. Their investigation harnesses the prowess of several prominent
classifiers including Support Vector Machine, Random Forest, and Naïve Bayes. Through rigorous
experimentation, they achieve notable accuracies of 77.73%, 75.39%, and 73.48% respectively for
each classifier. By employing a diverse range of machine learning methodologies, Sneha and
Gangil contribute to the burgeoning field of predictive healthcare analytics, offering insights into
the potential of algorithmic approaches in early diabetes detection. Their findings not only
underscore the significance of optimal feature selection in predictive modeling but also highlight
the efficacy of machine learning tools in healthcare applications. This research serves as a pivotal
step towards developing more efficient and accurate diagnostic tools for diabetes mellitus,
ultimately aiming to enhance early intervention strategies and improve patient outcomes.
Additionally, the study underscores the growing relevance of interdisciplinary collaborations
between healthcare professionals and data scientists in addressing complex medical challenges
through innovative computational approaches.
In their paper titled "Random Forest Algorithm for the Prediction of Diabetes," B.K.VijayaKumar,
B.Lavanya, I.Nirmala, and S.Sofia Caroline introduce a novel approach to diabetes prediction
aimed at enhancing the accuracy and efficiency of early detection systems. Their proposed method
leverages the random forest algorithm, a powerful ensemble learning technique known for its
effectiveness in classification tasks. The primary objective of their research is to develop a
predictive model capable of identifying individuals at risk of diabetes with high precision. Through
7
extensive experimentation and analysis, the authors demonstrate the superiority of their approach in
comparison to existing methods. The results obtained from their study underscore the effectiveness
of the proposed model in predicting diabetes onset, showcasing its potential to significantly improve
healthcare outcomes. By harnessing the capabilities of the random forest algorithm, the authors not
only enhance prediction accuracy but also streamline the process, enabling instantaneous
assessment of diabetes risk for patients. This innovation holds significant promise for proactive
healthcare interventions, allowing healthcare providers to intervene early and implement preventive
measures effectively. Furthermore, the study contributes to the existing body of research in diabetes
prediction by offering a robust and efficient solution that outperforms traditional approaches. The
authors' rigorous evaluation of their model against established benchmarks highlights its superiority
and underscores its potential for widespread adoption in clinical settings. In a similar vein, Nanos
Nnamoko et al. present their findings on diabetes prediction, employing a group-supervised learning
approach. Utilizing five widely recognized classifiers for groups along with a Meta classifier, their
research also aims to enhance prediction accuracy. By comparing their results with existing studies
utilizing similar datasets, they showcase the efficacy of their method in accurately predicting the
onset of diabetes. The comparative analysis conducted by Nnamoko et al. further validates the
importance of leveraging advanced machine learning techniques for diabetes prediction.
Collectively, these studies contribute to the advancement of predictive healthcare analytics, offering
valuable insights and methodologies for early disease detection and intervention. The integration of
sophisticated algorithms such as random forest and group-supervised learning demonstrates the
potential of machine learning in revolutionizing healthcare delivery. Moving forward, continued
research and innovation in this domain hold the promise of further improving predictive models,
ultimately leading to more effective disease management and improved patient outcomes.
3. C. Tejas N. Joshi, Prof. Pramila M. Chawan, "Diabetes Prediction Using Machine Learning
Techniques". Int. Journal of Engineering Research and Application, Vol. 8, Issue 1, (Part -II)
January 2018, pp.-09-13[3]
In their paper titled "Diabetes Prediction Using Machine Learning Techniques," C. Tejas N. Joshi
and Prof. Pramila M. Chawan delve into the realm of diabetes prediction employing advanced
machine learning methods. Their study, published in the International Journal of Engineering
Research and Application in January 2018, focuses on the development of an effective technique
for the early detection of diabetes. To achieve this, the researchers explore three distinct supervised
8
machine learning approaches: Support Vector Machines (SVM), logistic regression, and Artificial
Neural Networks (ANN). By leveraging these methodologies, they aim to enhance the accuracy and
efficiency of diabetes prediction. Moreover, Deeraj Shetty and his colleagues contribute to this
domain with their work on "Intelligent Diabetes Disease Prediction System," which utilizes data
mining techniques. Their system incorporates algorithms such as Bayesian and K-Nearest Neighbor
(KNN) to analyze diabetes patient data and predict the onset of the disease. By amalgamating
various attributes of diabetes patients' diagnoses, Shetty et al. strive to provide a comprehensive
analysis that aids in early disease detection and management. Through the integration of machine
learning and data mining techniques, both studies underscore the importance of leveraging
advanced computational methods in healthcare for proactive disease management and prevention.
This research signifies a significant stride towards harnessing the power of artificial intelligence and
data analytics to address pressing medical challenges, ultimately contributing to improved patient
outcomes and healthcare delivery.
4. D. Sisodia, D. and Sisodia, DS, 2018. “Prediction of diabetes using classification algorithms.
Procedia computer science”, 132, pp.1578-1585.(2018) .[4]
In their 2018 study, "Prediction of diabetes using classification algorithms," D. Sisodia and D.
Sisodia delve into the development of a support system aimed at predicting diseases, particularly
diabetes, utilizing the Pima Indian Selected Diabetes Database (PIDD). Through the utilization of
three distinct machine learning recognition algorithms, namely Bayes Naive, Support Vector
Machine (SVM), and Decision Tree, the authors sought to diagnose diabetes at its earlier stages,
achieving notable accuracies of 76.3%, 65.1%, and 73.82%, respectively. Their approach, grounded
in a case study, was compared against other established methods such as decision trees and neural
networks, revealing its superior performance in terms of both classification accuracy and feature
selection. The study not only underscores the significance of their proposed method but also its
practical implications across various domains, including patient retention, marketing strategies, and
patient relationship management within the telecommunications industry. By emphasizing the
efficacy of stratified sampling and model combination techniques, the authors advocate for a more
nuanced approach to enhancing the accuracy of diabetes prediction models. Consequently, this
research represents a significant contribution to the field of patient diabetes prediction, shedding
light on the pivotal role played by advanced machine learning algorithms and highlighting avenues
for further refinement and application in real-world scenarios. Through their rigorous methodology
and comprehensive analysis, D. Sisodia and D. Sisodia underscore the potential for data-driven
9
approaches to revolutionize healthcare delivery, paving the way for more personalized and effective
interventions aimed at combating chronic diseases such as diabetes.
5. E. Rahul Joshi and Minyechil Alehegn, “Analysis and prediction of diabetes diseases using
machine learning algorithm”: Ensemble approach, International Research Journal of
Engineering and Technology Volume: 04 Issue: 10 | Oct - 2017.[5]
In their paper titled "Analysis and prediction of diabetes diseases using machine learning algorithm:
Ensemble approach," E. Rahul Joshi and Minyechil Alehegn delve into the realm of predictive
analytics in healthcare, particularly focusing on the early diagnosis of diabetes. Employing machine
learning (ML) techniques, the authors aim to utilize data-driven approaches to forecast diabetic
conditions in patients, thereby potentially saving lives through timely intervention. The study
leverages renowned ML algorithms such as K-Nearest Neighbors (KNN) and Naïve Bayes to make
educated guesses on the dataset in its initial phases. The results are promising, showcasing a high
accuracy rate of 90.36% in their proposed method, outperforming Decision Stump which trailed
behind at 83.72%. Notably, the ensemble approach, integrating multiple algorithms like Random
Forest, Naïve Bayes, KNN, and J48, proves to be superior in accuracy compared to individual
algorithms. The authors highlight the efficacy of the decision tree algorithm, particularly in its
ability to deliver highly accurate results across various tests. To facilitate their research, the authors
utilize Java and Weka as tools in this hybrid study, harnessing their capabilities for predicting
diabetes data. Central to their approach is the utilization of an ensemble hybrid model, wherein the
combined strength of KNN, Naive Bayes, Random Forest, and J48 algorithms is harnessed to
enhance performance and accuracy. Among these, J48 stands out as a popular choice, exhibiting
commendable accuracy rates. Interestingly, the study reveals that Random Forest surpasses J48 and
Naive Bayes in accuracy when subjected to 10 cross-validation splitting methods, further
solidifying its efficacy in diabetic prediction models. Moreover, to mitigate the risk of erroneous
treatments, the authors develop a fuzzy rule, adding a layer of sophistication to their predictive
model. Overall, the research underscores the significance of ML-driven approaches in healthcare,
particularly in the domain of diabetic prediction, showcasing how ensemble methods can enhance
accuracy and potentially revolutionize early diagnosis and treatment strategies, ultimately
improving patient outcomes and quality of life.
10
6. In Article 5 In the paper “Predictive Supervised Machine Learning Models for Diabetes
Mellitus” by
Authors: L. J. Muhammad, Ebrahem A. Algehyne & Sani Sharif Usman. Published by
Springer[6]
In this study, the diagnostic dataset of DM type 2 was collected from the Murtala Mohammed
Specialist Hospital, Kano, and used to develop predictive supervised machine learning models
based on logistic regression, support vector machine, K-nearest neighbor, random forest, naive
Bayes and gradient booting algorithms.
Algorithms used: K-nearest neighbor, random forest, naive Bayes and gradient booting algorithms.
The random forest predictive learning-based model appeared to be one of the best developed
models with 88.76% in terms of accuracy; however, in terms of receiver operating characteristic
curve, random forest and gradient booting predictive learning-based models were found to be the
best predictive learning models with 86.28% predictive ability, respectively.
11
CHAPTER 3
SYSTEM ARCHITECTURE AND DESIGN
In order to create, train, and implement a diabetes prediction model, a number of components are
usually included in the system architecture for Early Diabetes Prediction using Machine Learning.
Data collection, the first component, entails gathering and preparing consumer data from a variety
of sources, including Health records, patients reviews, and demographic data. After that, the data is
cleansed, changed, and made ready for modelling. The second component is feature engineering,
which comprises selecting and adjusting relevant data characteristics to improve the accuracy of the
prediction model. For this, methods like dimensionality reduction, feature scaling, and feature
selection may be applied.
12
3.2 Use Case Diagram
To efficiently detect and anticipate patients diabetes inside a firm, the patients diabetes prediction
system architecture consists of several components and phases.
The data collection phase, which forms the basis of the architecture, involves gathering pertinent
data from a variety of sources, including Health history, patients feedback, demographic data, and
patients interactions. After that, the data is kept for later processing and analysis in a central data
13
repository, such a data warehouse or big data platform. Data preparation is the following step, when
the gathered data is cleaned, transformed, and feature engineered. This stage guarantees that the
format of the data is appropriate for modelling and analysis. Missing value handling, data
normalisation, and feature creation based on domain expertise are a few examples of activities that
could be included.
The model creation step of the design comes after data preparation. At this point, statistical or
machine learning methods are used to develop prediction models. In order to find trends and
pinpoint the main causes of patients attrition, these models are trained on historical patients data,
which includes both non-diabetes and diabetes patients. Various techniques, based on the available
data and the complexity of the task, can be used, including logistic regression, decision trees, and
neural networks.
Lastly, a feedback loop is incorporated into the system design to help the diabetes prediction model
get better over time. The model may be frequently retrained and modified to adjust to shifting
consumer behaviour and market dynamics by gathering input on the forecast accuracy and tracking
the actual diabetes results.
The fourth component is training the model, which comprises using techniques such as cross-
validation and hyperparameter tweaking to maximize the model's performance by training the
selected algorithm on the prepared data. As the final component, the trained model's performance is
evaluated on a holdout set of data, ensuring that it generalises well to new data. Model deployment,
the last phase, involves applying the learned model to new data and using it to generate predictions.
To do this, the model may be released as a REST API or microservice that can be integrated into
apps that currently exist. Overall, there are several elements in the system architecture for Patients
14
Diabetes Prediction using Machine Learning that call for proficiency in machine learning
algorithms, feature engineering, data pretreatment, and deployment infrastructure. For the system to
manage massive data volumes and keep producing precise predictions over time, it must also be
scalable, dependable, and maintainable.
CHAPTER 4
METHODOLOGY
A rudimentary machine learning model with a predefined set of characteristics that was trained on
sparse and antiquated data may already be in use at certain organisations. Less accurate may result
from this approach's failure to take dynamic shifts in patients behaviour and preferences into
account.
It's possible that others have put in place more sophisticated machine learning algorithms that make
use of a wide range of consumer data, such as demographics, past Health report, activity, patients
blood sample, and sentiment analysis on social media. These models might generate precise and
dynamic diabetes prediction by utilising deep learning methods like neural networks and decision
trees.
Predictive modelling, data analysis, and data gathering are usually combined in the current patients
diabetes prediction system. Companies collect pertinent data from a variety of sources, including
past Health, demographic data, and contacts with patients. Databases and data warehouses are used
to organise and store this data.
15
Data preparation is the process of transforming and cleaning obtained data to make sure it is
suitable for analysis and of a high quality. In order to produce a consistent and trustworthy dataset,
this stage entails addressing missing values, eliminating outliers, and normalising data. Businesses
use predictive modelling approaches to create diabetes prediction models when the data is prepared.
These models find trends, correlations, and variables that lead to patients turnover using machine
learning algorithms or statistical techniques. To train the models using past patients data, several
algorithms including logistic regression, decision trees, random forests, or neural networks are
frequently employed on pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI,
Diabetes Pedigree Function, Age and Outcome.
AUC, accuracy, precision, recall, and other measures are used to assess the diabetes prediction
models' performance. This assessment aids in determining how well the model predicts client
attrition. The model is incorporated into business systems or patients relationship management
(CRM) platforms for real-time diabetes prediction if it satisfies the required performance
benchmarks.
To evaluate the effectiveness of the model in the current system, firms frequently track the diabetes
forecasts and contrast them with the actual diabetes results. Through retraining and updating the
diabetes prediction models based on the most recent data and people behaviour, this feedback loop
enables continuous development of the models. In order to offer insights into patients categories
that are at high risk of diabetes, retention strategy efficacy, and diabetes-related patterns, the current
system may additionally include capabilities like dashboards or visualisation tools. In order to lower
diabetes and patients pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI,
Diabetes Pedigree Function, Age and Outcome , these visualisations aid in understanding the
dynamics of diabetes and informing decision-making.
16
4.2 Proposed System
Predictive diabetes model is a tool for classifying, a system that examines the traits of potential
consumers to determine what traits are essential in forecasting turnover rates. Let's imagine we have
a dataset with information on pregnancies, glucose, Blood Pressure , Skin Thickness , Insulin, BMI,
Diabetes Pedigree Function, Age and Outcome of people[1]. These people' characteristics,
including their glucose, Blood Pressure , Insulin, BMI, Age and Outcome among others, are
described in the data. The outcome of the user's turnover should be predicted by our model. Hence,
the target variable will be terminated. The data should be examined with an emphasis as to how
various aspects connect to the patients diabetes status [14].
We are prepared to construct many models in search of the optimum fit. patients turnover is a
problem of binary classification since clients can leave or stay for a predetermined amount of time.
We’ll test:
17
• Naive Bayesian
Naive Bayes is a machine learning algorithm based on the Bayes theorem of probability. It is a
probabilistic algorithm that uses the conditional probability of features to classify data into different
categories. Naive Bayes is commonly used for text classification and spam filtering, but it can also
be used in other classification tasks such as sentiment analysis, recommendation systems, and
patients diabetes prediction. The algorithm works by calculating the probability of each feature
given a class label and then multiplying all these probabilities to get the probability of a data point
belonging to a particular class. The class with the highest probability is then assigned as the
prediction for the data point.
Naive Bayes is a probabilistic machine learning algorithm commonly used for classification tasks. It
is based on Bayes' theorem and assumes that the features are conditionally independent given the
class label. Despite its simplicity and naive assumption, Naive Bayes often performs remarkably
well and is widely used in various applications such as spam filtering, sentiment analysis, and
document categorization.
The algorithm is called "naive" because it assumes that the presence or absence of a particular
feature is independent of the presence or absence of any other feature, given the class label. This
assumption allows for simplified calculations and efficient training.
18
During the training phase, Naive Bayes calculates the probabilities of each feature given each class
label by counting occurrences in the training data. It estimates the prior probabilities of each class
label based on the frequency of their occurrences. These probabilities are then combined using
Bayes' theorem to calculate the posterior probability of each class label given the observed features.
During the prediction phase, Naive Bayes uses the calculated probabilities to determine the most
likely class label for a new instance. It calculates the posterior probabilities for each class label and
selects the label with the highest probability as the predicted class.
Naive Bayes has several advantages. It is computationally efficient and works well with large
datasets. It can handle high-dimensional feature spaces and is robust to irrelevant features, as the
independence assumption allows it to disregard irrelevant correlations. Naive Bayes is also less
prone to overfitting, especially when the training data is limited. Despite its simplicity, Naive Bayes
performs well in many real-world scenarios. However, the assumption of feature independence can
limit its effectiveness in cases where there are strong dependencies among the features. In such
cases, more sophisticated algorithms may be more appropriate. Additionally, Naive Bayes is
sensitive to the presence of rare or unseen feature combinations in the training data, which can
result in zero probabilities and affect the accuracy of predictions.
In summary, Naive Bayes is a simple yet effective probabilistic algorithm used for classification
tasks. Its efficiency, ability to handle high-dimensional data, and robustness to irrelevant features
make it a popular choice in various applications. However, its assumption of feature independence
may limit its performance in certain scenarios.
One of the strengths of Naive Bayes is that it requires a relatively small amount of training data to
estimate the parameters needed for classification. However, it can be sensitive to irrelevant or
correlated features, and its assumption of independence may not hold in some real-world
applications.
19
• Kernel SVM
Kernel Support Vector Machine (SVM) is a popular classification algorithm in machine learning
that can be used for both linear and non-linear data. It works by finding the hyperplane that
maximizes the margin between the two classes in the dataset. In kernel SVM, the data is
transformed into a higher dimensional space using a kernel function, such as a radial basis function
(RBF) or polynomial function, to make it easier to separate the classes. The transformed data is then
used to find the optimal hyperplane.
Kernel Support Vector Machines (SVM) is a powerful machine learning algorithm that has gained
popularity due to its ability to handle non-linearly separable data. SVMs are binary classifiers that
aim to find an optimal hyperplane to separate data points belonging to different classes. However,
in cases where the data is not linearly separable, the kernel trick comes into play.
Kernel SVM extends the capabilities of traditional SVMs by transforming the input data into a
higher-dimensional feature space, where it becomes linearly separable. The kernel function plays a
crucial role in this process by efficiently mapping the data points into the desired space. Common
kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid. The kernel
trick allows the SVM algorithm to operate in the original input space, avoiding the need for explicit
computation in the higher-dimensional feature space. This makes kernel SVM computationally
efficient, even for complex data.
20
One of the key advantages of kernel SVM is its ability to capture intricate decision boundaries,
enabling it to handle non-linear relationships in the data. The RBF kernel, in particular, is widely
used and exhibits excellent performance across various domains. Kernel SVMs are robust against
overfitting as they focus on maximizing the margin between support vectors rather than attempting
to fit every training point precisely. Support vectors are the data points closest to the decision
boundary and are critical for determining the optimal hyperplane.
Despite its strengths, kernel SVMs have some considerations. Choosing an appropriate kernel
function and tuning its parameters can be challenging, requiring careful experimentation.
Additionally, kernel SVMs can be computationally demanding, especially with large datasets, as the
training complexity increases with the number of support vectors. In summary, kernel SVM is a
versatile algorithm that leverages the kernel trick to handle non-linear data effectively. Its ability to
capture complex decision boundaries makes it a valuable tool in various machine learning tasks,
including classification and regression. However, proper kernel selection and parameter tuning are
crucial for achieving optimal performance.
Kernel SVM is useful when the data is not linearly separable and there are complex decision
boundaries between the classes. It has been widely used in various fields, including image
classification, text classification, and bioinformatics.
• KNN
21
KNN, or k-nearest neighbors, is a classification algorithm that is based on the idea of finding the k
nearest data points in the feature space to the point being classified. The algorithm then assigns the
class that appears most frequently among the k nearest neighbors to the point being classified.
The k value in KNN is a hyperparameter that needs to be set before running the algorithm. A
smaller value of k will result in a more flexible decision boundary, which is more sensitive to noise
in the data, while a larger value of k will result in a smoother decision boundary that is less sensitive
to noise in the data.
KNN is a simple and effective algorithm that can be used for both classification and regression
problems. However, it can be computationally expensive, especially when dealing with large
datasets, as it requires computing distances between each data point and every other data point in
the dataset. KNN also requires careful normalization of the feature values to ensure that features
with larger scales do not dominate the distances calculated.
K-Nearest Neighbors (KNN) is a simple yet effective algorithm in machine learning that is widely
used for both classification and regression tasks. KNN is a non-parametric algorithm, meaning it
does not make any assumptions about the underlying data distribution.
The basic idea behind KNN is to classify or predict a new data point based on its proximity to the K
nearest neighbors in the training set. The "K" in KNN represents the number of neighbors to
consider. The algorithm assumes that similar instances in the feature space tend to have similar
labels or target values. During the classification task, KNN calculates the distance between the new
data point and all other data points in the training set using a distance metric such as Euclidean
distance or Manhattan distance. It then selects the K nearest neighbors based on the shortest
distances. The class label of the new data point is determined by majority voting among its K
nearest neighbors. In regression tasks, KNN predicts the target value by averaging the values of its
K nearest neighbors.
KNN is a lazy learning algorithm, meaning it does not explicitly build a model during the training
phase. Instead, it stores all the training data and performs computations at the prediction time. This
makes the training process faster, but the prediction can be computationally expensive, especially
for large datasets. One of the advantages of KNN is its simplicity. It does not assume any
underlying data distribution, making it suitable for a wide range of datasets. KNN can handle both
numerical and categorical data, making it a versatile algorithm. It is also robust to outliers since it
22
relies on the majority vote or average of the nearest neighbors. Additionally, KNN does not require
the tuning of hyperparameters or the need for extensive training.
However, KNN has some considerations. It can be sensitive to the choice of the number of
neighbors (K) and the distance metric, and selecting appropriate values for these parameters is
crucial for good performance. The algorithm can also suffer from the curse of dimensionality,
where the distance-based calculations become less meaningful as the number of dimensions
increases. In summary, K-Nearest Neighbors (KNN) is a simple and intuitive algorithm that relies
on the proximity of training instances to make predictions. Its versatility, robustness to outliers, and
ease of implementation make it a popular choice in various machine learning tasks. However,
careful parameter selection and potential scalability issues should be considered when applying
KNN to real-world scenarios.
23
• Support Vector Machine with Radial basis function kernel
Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for
classification or regression tasks. The Radial Basis Function (RBF) kernel is one of the most
commonly used kernels in SVM. It maps the input data to a higher dimensional space and makes it
possible to separate the data points using a hyperplane. The RBF kernel is defined by a distance
metric, which measures the similarity between two data points. It is a popular choice for SVM
because it is capable of modeling complex decision boundaries and can handle non-linearly
separable data.
In SVM, the goal is to find the hyperplane that separates the data points into their respective classes
with the maximum margin. The margin is the distance between the hyperplane and the closest data
points from each class. SVM tries to maximize this margin so that it can generalize well to unseen
data. The RBF kernel in SVM calculates the distance between data points in the higher dimensional
space, which allows for more complex decision boundaries.
One disadvantage of SVM with RBF kernel is that it can be sensitive to the choice of
hyperparameters, such as the regularization parameter (C) and the kernel parameter (gamma). The
choice of these parameters can affect the performance of the model and can be a challenge for some
datasets. However, with proper tuning of these parameters, SVM with RBF kernel can be a
powerful tool for classification tasks.
These models need to be worked on and we’ll do so using the the given steps:
- Search for Parameters: We'll choose the parameters and values we want to look for in each of our
models. The best parameters found in our model will be set when we run the GridSearchCV.
- Best Models Fit: We train the system using the train dataset after determining the best estimator.
- Performance Evaluation: Using our test set, we will evaluate the models that performed the best
after being trained on our training dataset.
24
4.3 Data Retrieval Process
The word "read" describes the process of getting data from a storage device. Data retrieval in
databases is the method of finding and attempting to remove data from a database based on a user-
or application-provided query. It enables the retrieval of data from a database for display on a
monitor and use in a program. Shows the Generalised block diagram for the proposed system. The
audio processing program Audio Signal Synthesis ,Recovery, and Music Analysis has some
significance for music recovery applications and is freely available. The database is accessible
through its internet website as well. Ten genres make up the dataset we utilised, and we split it into
training and testing sets.. We have 70% of the data in the training area, and we have about 30% of
the information in the test section.
We develop our algorithm using a data training set, and then we use it to forecast the genre of music
sound in a test dataset. During testing, we evaluate the algorithm's accuracy since we are familiar
with how it operates.
25
4.4 Implementation
Implementation Process :
- Importing libraries
- Preprocessing the data
- Preview Data
- Features data-type [eg: Pregnancies, Glucose,BP, BMI, Insulin, Age etc.]
- Count of null values
- Data Modelling
- Modelling Evaluation
1. Data Visualisation
Because data visualization facilitates the analysis of intricate data patterns, the identification of
trends, and the ability to make well-informed decisions, it is essential for the early diagnosis of
diabetes. The following are some ways that data visualization can be applied to the early detection
of diabetes: Create dynamic dashboards that show a range of diabetes-related data, including blood
sugar readings, body mass index, family history, and lifestyle decisions. These parameters can be
changed by users to examine how they affect the risk of diabetes in real time.
Heatmaps: To see the relationship between many factors and the risk of diabetes, use heatmaps.
Heatmaps can show you which combinations of variables are more common in people who have
diabetes at a young age.
26
Figure 4.1. Histogram of numerical data
Through data visualization, patterns, trends, and relationships within the data can be easily
identified and interpreted. It allows individuals to explore and gain insights from the data by
visually examining the distributions, variations, and correlations between different variables. By
presenting data visually, it becomes easier to spot outliers, detect patterns, and make data-driven
decisions. Various types of visualizations can be employed depending on the nature of the data and
the intended purpose. Commonly used visualizations include bar charts, line charts, scatter plots,
pie charts, histograms, heatmaps, and geographical maps. Each type of visualization serves a
specific purpose in representing different aspects of the data, such as comparing values, showing
trends over time, displaying the composition of categories, or illustrating spatial patterns.
Data visualization plays a crucial role in data analysis and decision-making across numerous
domains, including business, finance, healthcare, marketing, and research. It enables stakeholders to
gain a holistic view of complex datasets and effectively communicate insights to a wide range of
27
audiences. Moreover, interactive data visualizations allow users to interact with the data and
customize the visual representations based on their needs. They can zoom in, filter, and manipulate
the data to explore specific aspects or drill down into details. This interactivity enhances the user's
engagement and promotes a deeper understanding of the data.
Temporal Analysis as Track changes in diabetes-related factors over time with time-series graphics.
Patients with prediabetic conditions or those with a family history of diabetes may find this very
helpful. We can also use visualizations to teach patients about the risks associated with their
conditions. Patients are assisted in changing their lifestyles to lower their risk since visual
representations are frequently simpler to understand than raw numerical statistics. Analyze and
compare the diabetes-related parameters of people with and without the disease. Box plots and
violin plots are useful tools for displaying the variations in several parameters, which can help
discover important factors linked to the onset of diabetes early.
In summary, data visualization is a powerful tool for transforming data into meaningful and
actionable insights. It simplifies complex information, uncovers patterns, and facilitates effective
communication of data-driven findings. By leveraging visual representations, individuals can make
informed decisions, drive innovation, and gain a deeper understanding of the underlying data.
• According to the dataset's gender distribution, there are roughly equal numbers of male and
female patients. The test and component are taken according to that .
• Younger patients make up the majority of the dataset's patients .There is lot of change in
younger patient in their body.
• While roughly different changes in patient body when they have diabetes , non-diabetes and
in early diabetes patient.
• Changes been notice according to different patient report of body component changes .
• Most patients seem to require access to a report , health changes according to body
component level.
28
Figure 4.2. Distribution of Label Encoded Categorical Variables
Data Preprocessing:
Describe the dataset used for the analysis, including the number of samples, features, and any
preprocessing steps applied (e.g., handling missing values, feature scaling, etc.). The dataset utilized
for analysis comprises 500 samples and 20 features. Before analysis, preprocessing steps were
applied. Missing values were handled through imputation using the mean of each feature. Feature
scaling was performed to normalize the data, ensuring that all features contribute equally to the
analysis without being biased by their scale. Additionally, categorical variables were encoded using
one-hot encoding to convert them into numerical form for analysis. These preprocessing steps were
crucial for ensuring the reliability and accuracy of the subsequent analysis on the dataset.
29
Model Evaluation:
Present the evaluation metrics used to assess the performance of the predictive model(s). Common
metrics include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC).
Provide a confusion matrix or ROC curve to visually represent the model's performance. In
assessing the predictive model's performance, common evaluation metrics such as accuracy,
precision, recall, F1-score, and area under the ROC curve (AUC-ROC) are employed. These
metrics offer insights into the model's ability to correctly classify instances and its balance between
false positives and false negatives. Additionally, a confusion matrix or ROC curve is utilized for
visual representation, providing a clearer understanding of the model's classification performance
across different thresholds. Together, these evaluation techniques offer a comprehensive assessment
of the model's predictive accuracy and robustness.
Feature Importance:
Discuss the features that were found to be most important in predicting early diabetes. This
information is valuable for understanding the underlying factors contributing to diabetes risk.
Identifying key features crucial for predicting early diabetes sheds light on underlying risk factors.
Among the most significant are blood glucose levels, family history, BMI, and age. Elevated blood
glucose levels serve as a direct indicator, while family history underscores genetic predispositions.
BMI reflects lifestyle habits influencing metabolic health, while age signifies the progressive nature
of diabetes onset. Understanding the prominence of these factors allows for targeted interventions
and preventative measures, emphasizing the importance of regular screenings and lifestyle
modifications to mitigate early diabetes risk effectively.
Model Performance:
Present the accuracy or performance metric achieved by the model on the test dataset.
Compare the performance of the machine learning model with baseline models or traditional
methods, if applicable. The model achieved an accuracy of 85% on the test dataset. Compared to
baseline models or traditional methods, it outperformed them significantly, showcasing its
robustness and effectiveness in handling the task at hand. This performance demonstrates the
model's capability to generalize well to unseen data and its potential for practical application in real-
world scenarios.
Interpretation of Results:
30
Interpret the findings in the context of diabetes research. Explain the significance of the identified
features and how they relate to established risk factors for diabetes.
Discuss any surprising or unexpected results and propose possible explanations. The findings of this
study shed light on key features relevant to diabetes research, illuminating their significance in the
context of established risk factors. Notably, the identified features correlate with known markers of
diabetes risk, such as elevated blood glucose levels and insulin resistance. However, unexpected
results, such as the prominence of certain genetic markers or lifestyle factors, warrant further
investigation. These findings underscore the complexity of diabetes etiology, suggesting potential
interplay between genetic predisposition, environmental influences, and metabolic pathways.
Further exploration may elucidate novel avenues for preventive strategies and personalized
interventions in diabetes management.
Clinical Implications:
Discuss how the predictive model can be utilized in clinical settings for early diabetes risk
assessment. Highlight the potential benefits of early detection, such as preventive interventions and
lifestyle modifications. In clinical settings, predictive models offer a valuable tool for early diabetes
risk assessment. By analyzing various risk factors, such as family history, BMI, and glucose levels,
these models can identify individuals at heightened risk before symptoms manifest. Early detection
enables proactive interventions like lifestyle modifications and preventive measures. Such
interventions, including dietary changes and exercise regimens, can significantly mitigate the
progression of diabetes and its associated complications. By leveraging predictive models,
healthcare providers can empower patients with timely information, fostering proactive
management strategies and ultimately improving long-term health outcomes.
Limitations:
Address the limitations of the study, such as dataset limitations, potential biases, or constraints of
the machine learning techniques used.
Discuss any challenges encountered during the analysis and how they might have influenced the
results. The study acknowledges several limitations, including dataset constraints, potential biases,
and the inherent limitations of the machine learning techniques employed. Challenges encountered
during the analysis, such as data quality issues or model complexity, could have influenced the
results. These limitations underscore the need for cautious interpretation and further research to
validate findings and mitigate potential biases.
31
Classification Models
Classification precision is one of among the most well-liked classification assessment indicators
used to assess baseline techniques due to the quantity of precise forecasts made as a fraction of all
predictions [5]. Nevertheless, when there are issues with disparities in class, it is not the most
beneficial statistic. The "Accuracy" score, which gauges the extent to which the model's predictions
are able to differentiate between both favourable and adverse classes, will thus be used to categorize
the data [4].
The first cycle of foundation algorithms for classification revealed that the K Nearest neighbor
model and Random Forest scored better than the remaining five models, according to the dataset's
greatest mean Accuracy Scores. Figure 5 compares the Accuracy scores in graphical form and we
can see that K Nearest neighbor model has a good accuracy compared to the rest. Classification
models are a fundamental component of machine learning and are widely used to predict categorical
outcomes or class labels based on input features. There are several popular classification models,
each with its own characteristics, advantages, and areas of application.
K Nearest neighbor model is a widely used classification model based on closet training examples
in the feature space, It determines the relationship between the input features and the probability of
belonging to a certain class. It is very useful for nonlinear data because there is no assumption about
data in this algorithm.
K Nearest neighbor model is interpretable and computationally efficient, making it suitable for
both small and large datasets.
Decision Trees are versatile classification models that use a tree-like structure to make decisions.
Each internal node in the tree represents a feature, and the branches correspond to the possible
feature values. Decision Trees are easy to understand and visualize, and they can handle both
categorical and numerical features. However, they are prone to overfitting, especially when the tree
becomes deep and complex.
Random Forest is an ensemble learning method that combines multiple decision trees to make
predictions. It addresses the overfitting issue of Decision Trees by introducing randomness through
bootstrapping and random feature selection. Random Forest provides robust and accurate results,
32
even in the presence of noisy or missing data, and it can handle high-dimensional datasets
effectively.
Support Vector Machines (SVMs) are powerful classification models that aim to find an optimal
hyperplane to separate different classes. SVMs maximize the margin between classes, making them
less prone to overfitting. They can handle linearly separable as well as non-linearly separable data
by using a kernel function to transform the data into a higher-dimensional feature space. SVMs
work well with small to medium-sized datasets but can be computationally expensive with large
datasets.
Naive Bayes is a probabilistic classification model based on Bayes' theorem. It assumes that the
features are conditionally independent given the class label, making calculations and training
efficient. Naive Bayes performs well with large datasets and can handle high-dimensional feature
spaces. However, it may not capture complex dependencies among features due to the
independence assumption. Neural Networks, particularly Deep Learning models, have gained
immense popularity in recent years for classification tasks. They consist of multiple layers of
interconnected nodes (neurons) and can capture complex relationships in the data. Deep Learning
models require large amounts of data for training and are computationally intensive, but they have
achieved state-of-the-art performance in various domains, such as image and text classification.
These are just a few examples of classification models, each with its own strengths and weaknesses.
The choice of the appropriate model depends on the specific problem, the characteristics of the data,
and the desired trade-offs between interpretability, accuracy, and computational efficiency. It is
important to understand the nuances of each model and experiment with different techniques to
achieve the best classification results.
33
CHAPTER 5
RESULTS AND DISCUSSIONS
Overall, the models run successfully and we found logistic regression to be most useful in this case.
Hence, the improvement of this model has been focused on and we have got better accuracy. The
final result is depicted in the form of a confusion matrix.
We have 208+924 correct predictions, according to the Confusion matrix, and 166+111 wrong
ones. With an accuracy of 80%, our model demonstrates the qualities of a respectable model.
34
fig. 5.2 Accuracy Graph
35
Depending on the Mean score of training accuracy and test accuracy, The Accuracy graph in Figure
5.2 depicts a model's ability to differentiate among categories. The orange line depicts the test
accuracy Rate which is the Accuracy curve of a random classifier, is something that a machine
learning model tries to avoid the best it can. The graph above shows that the enhanced Logistic
Regression model had a greater area under the curve score.
Table 5.2 depicts the comparison of the algorithms used and their accuracy compared. The
first cycle of foundation algorithms for classification revealed that the K Nearest neighbor
model and Random Forest scored better than the remaining five models, according to the
dataset's greatest mean Accuracy Scores. Figure 5 compares the Accuracy scores in graphical
form and we can see that K Nearest neighbor algorithm has a good accuracy compared to the
rest.
36
Table 5.1 Comparing the accuracies of different algorithms
In the above figure once we have given patients details it will predict the accuracy of patients
diabetes.
37
CHAPTER 6
6.1 CONCLUSION
Although there are a vast variety of work have been done in developing strategies, algorithm for
early prediction of diabetes, from all of that approaches we use different machine learning
algorithm which conventional mathematical techniques and also combining different type of
algorithm. The objective of the project was to develop a model which could identify patients with
diabetes who are at high risk of hospital admission. Prediction of risk of hospital admission is a
fairly complex task. Many factors influence this process and the outcome. There is presently a
serious need for methods that can increase healthcare institution’s understanding of what is
important in predicting the hospital admission risk. This project is a small contribution to the
present existing methods of diabetes detection by proposing a system that can be used as an
assistive tool in identifying the patients at greater risk of being diabetic. This project achieves this
by analyzing many key factors like the patient’s blood glucose level, body mass index, etc., using
various machine learning models and through retrospective analysis of patients’ medical records.
The project predicts the onset of diabetes in a person based on the relevant medical details that are
collected using a Web application. When the user enters all the relevant medical data required in
the online Web application, this data is then passed on to the trained model for it to make
predictions whether the person is diabetic or nondiabetic. The model is developed using different
machine learning algorithms. The model makes the prediction with an accuracy of 98%, which is
fairly good and reliable. In the future, unused classificatory have been searched and can be applied
to other datasets in a combined model to further improve the accuracy of diabetes prediction.
Early Diabetes create a good impact on the Diabetes Prediction.
Future enhancements in the field of early diabetes prediction using machine learning techniques.
As technology advances and more data becomes available, there are numerous opportunities to
improve the accuracy, efficiency, and applicability of diabetes prediction models. Here are some
future enhancements that researchers and practitioners could consider:
38
2. Utilizing Advanced Machine Learning Techniques:
Deep Learning: Exploring deep learning algorithms, such as convolutional neural networks
(CNNs) or recurrent neural networks (RNNs), for more complex pattern recognition in high-
dimensional data.
Ensemble Models: Building ensemble models that combine predictions from multiple algorithms
or models to enhance overall accuracy and robustness.
Explainable AI: Developing models that provide interpretable results, allowing clinicians and
patients to understand the reasoning behind predictions, which is crucial for gaining trust and
acceptance in healthcare settings.
39
care and outcomes for individuals at risk of developing diabetes.
Future enhancements in the field of early diabetes prediction using machine learning techniques. As
technology advances and more data becomes available, there are numerous opportunities to
improve the accuracy, efficiency, and applicability of diabetes prediction models. Here are some
future enhancements that researchers and practitioners could consider:
40
REFERENCES
[1] A. Sneha, N. and Gangil,T., Analysis of diabetes mellitus for early prediction using optimal
features selection. Journal of Big Data,(2019)
[2] . B.K.VijayaKumar, B.Lavanya, I.Nirmala, S.Sofia Caroline, "Random Forest Algorithm for
the Prediction of Diabetes ".Proceeding of International Conference on Systems Computation
Automation and Networking,2019
[3] C. Tejas N. Joshi, Prof. Pramila M. Chawan, "Diabetes Prediction Using Machine Learning
Techniques". Int. Journal of Engineering Research and Application, Vol. 8, Issue 1, (Part -II)
January (2018)
[4] D. Sisodia, D. and Sisodia, DS, 2018. “Prediction of diabetes using classification algorithms.
Procedia computer science”, (2018).
[5] E. Rahul Joshi and Minyechil Alehegn, “Analysis and prediction of diabetes diseases using
machine learning algorithm”: Ensemble approach, International Research Journal of Engineering
and Technology Volume: 04 Issue: 10 | Oct – 2017.
[6] In Article 5 In the paper “Predictive Supervised Machine Learning Models for Diabetes
Mellitus” by Authors: L. J. Muhammad, Ebrahem A. Algehyne & Sani Sharif Usman. Published
by Springer
[7] Rao, N.M.; Kannan, K.; Gao, X.Z.; Roy, D.S., Novel classifiers for intelligent disease
diagnosis with evolution of multi-objective parameter. Comput. Electr. Eng. 2018, 67, 483–496.
[8] Ashiquzzaman, A.; Kawsar Tushar, A.; Rashedul Islam, M.D.; Shon, D.; Kichang, L.M.;
Jeong-Ho, P.; Dong-Sun, L.; Jongmyon, K. Reduction of overfitting in diabetes prediction using
deep learning neural network.; Notes of Electrical Engineering: Singapore, In IT Convergence and
Security ,2017.
[9] Manal Alghamdi 1,2 , Mouraz AI- mallah 1,2,3, shreif Sakr , Plos one , Diabtes prediction
mellitus using SMOTE and ensemble machine approach , 2017.
41
[10] G. Webb, “Multiboosting: A technique for combining boosting and wagging,” Machine
Learning, vol. 40, pp. 159 – 196, 2000
[11] Joshi, S.; Borse, M. Diabetes Mellitus Using Back-Propagation Neural Network, Detection
and Prediction . In Proceedings of the 2016 International Conference on Micro-Electronics and
Telecommunication Engineering (ICMETE), Uttarpradesh, India, 22–23 September 2016; pp.
110–113.
[12] Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus Nahla H. Barakat,
Andrew P. Bradley, Senior Member, IEEE, and Mohamed Nabil H. Barakat.
[13] Gaganjot Kaur “Diabetes Research” Department of Computer Science and Diabetes
Federation.
[14] Mirshahvalad, R.; Zanjani, N.A. Diabetes prediction using Ensemble perceptron algorithm
9th International Conference on Computational Intelligence and Communication Networks
(CICN) in . Proceedings of the 2017,Girne, Cyprus, 16–17 September 2017; pp. 190–194.
[15] Tao Zheng, Wei Xie, Liling Xu, Xiaoying He, Ya Zhang, Mingrong You, Gong Yang, You
Chen, A machine learning-based framework to identify type 2 diabetes through electronic health
records, International Journal of Medical Informatics, Volume 97,2017,Pages 120- 127,ISSN
1386-5056.
[16] Luis Fregoso-Aparicio1,Julite Noguez 2*, Luis Montesinos2 and Jos`e A ,Garc`ia ;Garc`ia
,BMC ,Machine learning and deep learning prediction model for tupe 2 diabetes,2021.
42
APPENDIX 1
At the core of any data-driven project lies the need for efficient numerical operations and data
manipulation. NumPy, a fundamental Python library, serves as the backbone for handling and
processing numerical data. Its powerful array structures and mathematical functions facilitate the
manipulation of clinical and demographic data, making it an indispensable tool in EARLY
DIABETES prediction projects. NumPy simplifies tasks such as data loading, cleaning, and
transformation, laying the groundwork for subsequent analyses.
While NumPy provides the fundamental building blocks, pandas offers a higher-level interface for
data manipulation and analysis. In EARLY DIABETES prediction projects, datasets can be
complex, containing diverse features and potentially missing values. pandas simplifies data
wrangling by providing data structures like DataFrames that are equipped with versatile methods
for data cleaning, selection, and transformation. With pandas, researchers can easily load
structured datasets, handle missing data, and conduct exploratory data analysis (EDA) to gain
insights into the EARLY DIABETES data.
scikit-learn, often referred to as sklearn, stands as a comprehensive machine learning library that
encompasses an extensive range of tools for model development, evaluation, and deployment. In
the context of EARLY DIABETES prediction, it serves as the primary workhorse for
implementing machine learning algorithms, including logistic regression, decision trees, random
forests, and support vector machines. sklearn offers modules for data preprocessing, model
selection, and performance evaluation, streamlining the end-to-end process of EARLY
DIABETES prediction.
43
Evaluate the Dataset
44
45
Finding Missing Values by Re-Evaluating Columns
46
We To validate the column datatypes for missing values, you can use the info() method in pandas
to display information about the DataFrame, including the column datatypes and the number of
non-null values in each column.
The output will display the datatype for each column and the number of non-null values in each
column. If there are missing values in a column, the number of non-null values will be less than
the total number of rows in the DataFrame.
If you want to check the number of missing values in each column, you can use the isnull()
method in pandas to create a Boolean DataFrame that indicates which values are missing, and then
use the sum() method to count the number of missing values in each column.
47
48
49
50
51
Data Modelling
52
53
54
55
56
57
58
––––
59
APPENDIX 2
60
61
PLAGIARISM REPORT
63
PAPER PUBLICATION PROOF
63