Heart Disease Prediction Using Binary Classification
Heart Disease Prediction Using Binary Classification
Heart Disease Prediction Using Binary Classification
CSUSB ScholarWorks
5-2023
Part of the Computer Sciences Commons, and the Data Science Commons
Recommended Citation
Devare, Virendra Sunil, "Heart Disease Prediction Using Binary Classification" (2023). Electronic Theses,
Projects, and Dissertations. 1747.
https://scholarworks.lib.csusb.edu/etd/1747
This Project is brought to you for free and open access by the Office of Graduate Studies at CSUSB ScholarWorks.
It has been accepted for inclusion in Electronic Theses, Projects, and Dissertations by an authorized administrator
of CSUSB ScholarWorks. For more information, please contact scholarworks@csusb.edu.
HEART DISEASE PREDICTION USING BINARY CLASSIFICATION
A Project
Presented to the
Faculty of
San Bernardino
In Partial Fulfillment
Master of Science
in
Computer Science
by
May 2023
HEART DISEASE PREDICTION USING BINARY CLASSIFICATION
A Project
Presented to the
Faculty of
San Bernardino
By
Virendra Sunil Devare
May 2023
Approved by:
In this project, I built a neural network model to predict heard disease with
binary classification technique using patient information dataset from UCI Machine
and performed feature extraction. Our result shows that the model that I built has
hyperparameter tuning and 0.947 area under the curve in ROC curve analysis. In
addition, to identify the most important factors in heart disease prediction, I also
performed feature importance analysis. Our analysis showed that factors such as
type of chest pain, peak heart rate, and exercise-induced ST-segment depression
were among the strongest predictors of heart disease. Overall, the project
provided insights into heart disease classification. The model developed can be
confirm the model's performance in larger and more diverse patient populations.
iii
ACKNOWLEDGEMENTS
I'm writing to thank Dr. Jennifer Jin, the chair of my committee, for her unwavering
support and constant push to make our study effort the best it can be. I also want
to thank the members of my committee, Dr. Amir Ghasemkhani and Dr. Khalil
Dajani (Chair of Department), for accepting my invitation to join the committee and
having faith in me. I appreciate them putting their faith in me to complete this
academic career, therefore I'd also like to thank them all for their help.
University, San Bernardino has served a curriculum that will aid me in achieving
I owe a debt of gratitude to my father for paving the way for me to pursue a master’s
iv
TABLE OF CONTENTS
4.2 Splitting the Dataset into Training and Testing Sets .................... 17
5.2 Evaluating the Performance of The Model on The Test Set ....... 24
Tuning ................................................................................................ 26
v
CHAPTER SIX: CONCLUSION AND FUTURE WORK ............................... 30
REFERENCES ............................................................................................ 32
vi
LIST OF FIGURES
vii
CHAPTER ONE
INTRODUCTION
Heart disease is a major public health threat and causes millions of deaths
worldwide. Cardiovascular disease (CVD) refers to various heart and blood vessel-
related disorders, such as coronary artery disease, heart failure, stroke, and
peripheral artery disease. According to the World Health Organization (WHO), 17.9
million people die each year from CVD-related causes - accounting for 31% of all
global fatalities.
methods may not always be accurate, as some people may not exhibit symptoms
until late in the disease's progression. Machine learning and artificial intelligence
techniques have the potential to aid in early detection and diagnosis of heart
efficient predictive model for heart disease. The objective is to create a tool that
1
outcomes. Ultimately, this endeavor strives to contribute to advancements in
This project utilized the Heart Disease UCI dataset, which contains 14
attributes related to heart disease. This includes demographic information like age
cholesterol levels and maximum heart rate during exercise. With 303 instances
represented by each patient who has undergone diagnostic testing for heart
classes patients as having heart disease or not. This binary classification problem
requires the model to recognize patterns and relationships within its dataset in
order to make predictions on new, unseen data. The accuracy and efficiency of
heart disease using the Heart Disease UCI dataset. Its accuracy and efficiency will
be assessed using standard performance metrics, with the aim of creating an aid
2
that can facilitate early detection and diagnosis of heart disease, leading to
3
CHAPTER TWO
LITERATURE REVIEW
Heart disease is one of the leading causes of death worldwide and affects
condition can help avoid serious complications and improve patient outcomes.
medical big data sets, researchers have explored various approaches for
review we will highlight some recent research studies utilizing machine learning
neural network model to predict patients' risk for heart disease based on their
electronic health records; with an accuracy rate of 85%, this research proved the
4
Recently, Zhan, X. (2021) [2] used a deep learning algorithm to accurately
predict the risk of heart disease using demographic, clinical, and genetic data.
They employed 46,860 individuals in their study and achieved an accuracy rate of
78%. Furthermore, this work demonstrated how important feature selection can be
algorithms for predicting specific types of heart disease such as coronary artery
disease (CAD) and atrial fibrillation (AF). Li et al. (2020) [7] created a machine
learning model to predict CAD risk using clinical and demographic data; their
accuracy rate was 86%, showing its potential in this regard. Lee et al. (2019) [8]
utilized electrocardiogram (ECG) [7] data and machine learning algorithms with
need to be overcome, such as large and diverse datasets, feature selection, and
5
In conclusion, heart disease is a major public health concern and early
predicting the risk of heart disease using various types of data. While the studies
the challenges associated with creating accurate models for this purpose.
6
CHAPTER THREE
The workflow of the heart disease prediction system is clearly and coherently described in
disease prediction. To this end, I have utilized an open dataset from UCI Machine Learning
Repository which contains 303 instances of patients who have undergone cardiac [6]
evaluations and includes 14 attributes such as age, sex preference, chest pain type,
its source and authenticity. In this instance, this dataset has been widely used in numerous
heart disease prediction studies and cited in several peer-reviewed publications, indicating
data to gain an insight into its distribution, range, and any outliers. Doing this helps identify
any potential issues with the data and guides decisions regarding data cleaning and
preprocessing.
7
Figure 1 Explanation of Dataset (Latha & Jeeva, 2019) [21]
being collected during this process. In this instance, the dataset has been de-identified -
meaning personal identifying information has been removed to safeguard patient privacy.
Overall, data collection is essential for any machine learning project because it lays
the groundwork for subsequent steps like data cleaning, feature selection and model
training. By using a publicly accessible dataset from an authoritative source, I have ensured
its accuracy and validity - essential when predicting heart disease with accuracy and
reliability.
8
3.2 Data Cleaning
Data cleaning [2] is a critical step in any Machine Learning project, as it guarantees
the data is accurate and trustworthy for modeling and analysis. When it comes to heart
disease prediction, data cleaning involves detecting and correcting any errors or
inconsistencies in the dataset which could affect its predictive model's accuracy.
This project's data cleaning process involved several steps. The initial step involved
identifying and eliminating any missing data points from the dataset, so that no bias could
values were either replaced with an appropriate value such as the mean or median of the
The next step in data cleaning was to check for duplicates in the dataset. This step
ensured that each observation was unique and there were no repeating data points which
could distort analysis. Any duplicate observations were removed from the dataset.
The third step in data cleaning [5] was to identify and eliminate any outliers from
the dataset. Outliers are data points that lie far outside of most other data, which can
significantly impact model accuracy. In this project, outliers were identified using a box
9
Finally, the data was standardized to guarantee all attributes were on the same scale.
This was done by subtracting the mean and dividing by standard deviation for each
attribute. Standardization helps guarantee no one attribute has more influence over the
Overall, the data cleaning process was crucial in ensuring that the dataset was
accurate and reliable for use in the heart disease prediction model.
In the heart disease prediction system project, training data refers to a subset of the
cleaned dataset used for training the machine learning model. This training data [5] is
randomly selected from within the cleaned dataset with 70% of observations being used
Selecting training data is a critical step in the machine learning process, as it directly
influences the model's performance. If the training data does not represent the population
being studied, new data may not yield successful results from the model. Therefore, it's
population.
In this project, I randomly selected 70% of the cleaned dataset to train my model
with. Doing so helps minimize bias and guarantees that the training data is representative
10
of all users. By employing random selection, I can guarantee no biased towards any subset
of data.
Once the training data is selected, it can be used to train a machine learning model.
During this step, the model is adjusted to fit the training data by minimizing errors between
predicted output and actual output. The process continues until either it reaches an
Overall, selecting training data is an essential step in machine learning, and I have
taken great care to guarantee it represents the population by using random selection.
on unseen data. In this section of the project, we will discuss testing data used for
After cleaning the dataset, we randomly split it into two parts: 70% for
training and 30% for testing. Testing data was kept separate from training data
throughout model construction [14] and training to guarantee that the model is
generalization performance.
11
The testing dataset consisted of 91 patient records, each containing the
same 14 features as in the training dataset. These included age, sex, blood
labels were provided indicating whether a patient had heart disease or not.
Once trained on the training dataset [5], the model was applied to the testing
each instance. These predictions were then compared with actual labels to
It is essential to note that the testing dataset was kept separate from the
training dataset, and no information from it was used when training or tuning the
model on new, unseen data. The testing dataset remained separate from the
12
3.5 Model Construction
In this project, the dataset contains both categorical and numerical features
categories, such as: Types of chest pain. A numeric feature represents a sequence
each categorical variable into a binary vector defining the category to which the
However, because the goal is to predict whether a patient will acquire heart
disease, the model's output is binary. The model's binary output should show
whether the patient is most likely suffering from cardiac illness or not.
identification.
13
For example, when it comes to forecasting heart illness, categorical models
classification. This method also uses binary output variables. Because there
are just two possible values, this model can be applied to a wide range of
predict cardiac disease. This output variable can have two values: 0 for no heart
description of the heart disease prediction system, with details of the models used.
14
CHAPTER FOUR
MODEL DEVELOPMENT
In this chapter, we will cover the development of the neural network model
for heart disease classification. The chapter is divided into four parts: an overview
of neural networks and their architecture, splitting the dataset into training and
testing sets, building, and training the neural network model, and hyperparameter
tuning.
to mimic the operations of the human brain. They are made up by layered
dataset has been split into training and testing data, where training data has been
passed to the neural network where we have two hidden layers. In figure 3, I have
15
Figure 2 Architecture of Model [23]
network, data flows in one direction: from input layer to output layer. The input
receives input data before passing it along to hidden layers who process it before
16
The number of hidden layers and neurons within each layer can vary
with many hidden layers, are commonly employed for difficult issues like image or
speech recognition; on the other hand, shallow neural networks with fewer hidden
weights that are learned during training. The goal of training a neural network is to
adjust these weights to minimize error between predicted output and actual output;
gradient descent.
many machine learning problems. Their capacity for absorbing large amounts of
information and handling complex relationships between inputs make them ideal
for tasks such as image recognition, natural language processing, and predictive
modeling.
Building a model that can be easily adapted to new data sets is a critical
concept in machine learning. The data should be divided into two parts: one to
17
train the model and another to test the performance of the model to accomplish
the goal.
learning’s function and an important python machine leaning tool. The function
randomly splits the data set into two groups depending on a given ratio in this case,
It is essential to note that the split ratio can differ based on the size of the
dataset and complexity of the model being developed. A common practice is using
prevent overfitting. Overfitting occurs when a model is overly complex and fits its
training data too closely, leading to poor generalization on new data. By using
does not overfit to its training data and can generalize well with unknown new
inputs.
We utilized the Python Keras framework to build and train a neural network
18
which consists of an input layer, two hidden layers, and an output layer, was the
first stage. The input layer accepted input data, and the output classified heart
The input data was processed and analyzed by the hidden layers using a
network of linked neurons. With 16 neurons in the first hidden layer and 8 neurons
in the second, this model makes use of two hidden layers. Within these hidden
layers, a rectified linear unit (ReLU) activation function was used to bring
nonlinearity into the network, enhancing its ability to learn intricate correlations
The Adam [3] optimizer and binary cross-entropy loss function were used to
train the neural network model. A well-liked stochastic gradient descent (SGD)
during training. The difference between anticipated values and actual values is
calculated using the binary cross-entropy loss function, which is frequently used
The training process was carried out for a set number of epochs and batch
size. The number of epochs was set to 50, meaning the entire dataset passed
chosen so that weight updates could take place after processing 10 samples at
19
once. These values were optimized through experimentation in order to maximize
network performance.
After training, the accuracy of the model was evaluated using Keras'
evaluate () function. A testing set was utilized to gauge its performance and assess
its capacity to generalize to new data sets. Accuracy score provides percentage
the model the highest level of accuracy. The GridSearchCV function of scikit-learn
classification model.
and it is often difficult to determine the optimal values. Grid search works by
providing a set of possible values for each hyperparameter and training the model
using all possible combinations of those values. This exhaustive search across the
20
hyperparameter space reliably finds the best combinations but it can be
because it helped me identify the best set of hyperparameters for my model, while
function accepts as its argument a function that returns a Keras model that has
been assembled. A binary classification model with two hidden layers and a final
established using the GridSearchCV function. The number of folds to utilize for
validation in my code.
The best hyperparameters are shown together with the associated mean
test score after the search is finished. The model's accuracy over all cross-
validation folds for the specified hyperparameter is represented by the mean test
score.
21
In summary, this chapter explored the development of a neural network
and their architecture, split the dataset into training and testing sets, then built and
trained the model using Keras. As it turns out, this neural network model is
22
CHAPTER FIVE
MODEL EVALUATION
In this chapter, we will evaluate the performance of the neural network model
Hardware Requirement
• Apple M1 Chip, AMD Radeon RX480, and NVIDIA GeForce GTX 970
Google Colab
Cloud-based platform Google Colab offers essential GPU resources for machine
learning research. Google Colab was used for this project's calculations.
Python
project, libraries like NumPy, PyTorch, TensorFlow, cv2, Keras, plotly, and
23
5.2 Evaluating the Performance of The Model on The Test Set
set is an essential step in assessing its accuracy. We assessed this model using
various metrics such as accuracy, precision, recall, F1-score, and area under the
ROC curve.
instances in a testing set. The neural network model achieved an accuracy rate of
The precision metric measures the percentage of true positives out of all
predicted positives. In other words, it measures how often a model is correct when
it correctly predicts someone has heart disease. The neural network model
achieved an accuracy rate of 89%, meaning that out of all individuals predicted to
The recall metric counts the percentage of real positives among all positive
results. In other words, it assesses the frequency with which a model properly
identifies people with heart disease. The neural network model had a 100%
accuracy rate, which means that it accurately recognized 100% of all people who
24
The harmonic average of precision and recall, or F1-score, assesses a
situations is measured by the area under the ROC curve (AUC) metric. The neural
network model's AUC value of 0.947 demonstrates that it can reliably distinguish
94.98%. This is an improvement from the initial accuracy of 93.44%. This accuracy
learning.
high values for all evaluation metrics. These results indicate that this model is
reliable and can be utilized accurately when making individual predictions of heart
disease risks.
25
5.3 Comparing the Performance of Binary Model with Hyperparameter Tuning
As shown in Figure 4, Figure 5 and Figure 6, our results revealed that the
neural network model had the highest accuracy in heart disease classification. It
achieved an accuracy rate of 93.44% on our test set, while the categorical model
26
Figure 6 Accuracy after hyperparameter tuning.
Our models were visualized using ROC curves and confusion matrices. The
ROC curves demonstrated the trade-off between true positive rate and false
positive rate for various thresholds, while the confusion matrices displayed the
The neural network model's ROC curve had an area under the curve (AUC)
As can be seen from the graphs in Figure 7 and Figure 8, AUC (Area Under
Curve) is not constant but becomes increasingly nonlinear with increasing epochs.
27
With Hyperparameter Tuning accuracy at 94.98 %, which is greater than Binary
28
Figure 8 Training and Testing Loss for Binary Model
Similarly, as the number of epochs increases, the AUC for training loss and
test loss will decrease. Furthermore, after 50 epochs, validation loss and training
loss converge to the final value. This shows that the model has been trained
correctly.
Overall, our model evaluation showed that the neural network model
algorithms. Visualization of the results using the ROC curve and confusion matrix
29
CHAPTER SIX
remove missing values and perform feature scaling. I then experimented with
categorical model, to compare their performance with the neural network model.
the neural network model is the most suitable for heart disease classification. This
The modest size of the data collection is a constraint of this study. This may
30
Also, experimenting with different neural network architectures can improve the
expanding the data set by bringing together data from a variety of sources and
disease.
31
REFERENCES
1. Krittanawong, C., Zhang, H., Wang, Z., Aydar, M., & Kitai, T. (2017).
2. Zhan, X., Wang, Y., Zhang, J., Wang, Y., Liu, J., & Li, M. (2021).
2020.
6. Wong, C. X., Sun, M. T., Odutayo, A., Emdin, C. A., & Knuuti, J. (2021).
32
7. Zhang, Y., Wei, Z., Li, Y., Liu, Y., & Wang, J. (2020). Deep learning-based
8. Zhang, Y., Cui, X., Wu, Z., & Chen, S. (2019). A machine learning-based
10. Kachuee, M., Fazeli, S., & Sarrafzadeh, M. (2018). A survey of machine-
11. Zeng, X., Huang, Y., Zeng, D., & Zhuang, L. (2020). A hybrid model
12. Thanh Tung, T., & Giang Nguyen, T. (2020). A novel hybrid algorithm
1678-1691.
13. Subudhi S, Mishra AK. Heart disease prediction using machine learning: A
doi:10.1109/CCCI47647.2020.9074174.
33
14. Hou Y, Zhang Q, Wang Y, et al. heart disease prediction based on
International. 2021;2021:1-13.
of biomedical informatics.
16. Giri S, Karnatak H, Swain PK, Barik RK. A survey on heart disease
29. IEEE.
17. Liu J, Liang J, Li J, Chen Y, Chen J, Zhang Y. Feature selection for heart
Applications.
10.1016/j.tele.2018.11.007.
machine learning algorithms,’’ Mobile Inf. Syst., vol. 2018, pp. 1_21, Dec.
34
20. S. M. Saqlain, M. Sher, F. A. Shah, I. Khan, M. U. Ashraf, M. Awais,and
doi:10.1007/s10115-018-1185-y.
21. Latha, C. B., & Jeeva, S. C. (2019). Improving the accuracy of prediction
https://doi.org/10.1016/j.imu.2019.100203
10.1109/ICICT48043.2020.9112443.
35