0% found this document useful (0 votes)

5 views

2022 Research

The document discusses the use of Machine Learning (ML) techniques for predicting heart diseases, utilizing a dataset from UCI with 303 rows and 76 properties. Various ML methods, including Naive Bayes, SVM, Decision Tree, Random Forest, and K-Nearest Neighbor, are evaluated for their effectiveness in classification and prediction, with KNN demonstrating the best performance. The study aims to enhance the accuracy and efficiency of heart disease diagnosis through advanced ML approaches and feature selection methods.

Uploaded by

rachana.c.p04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

2022 Research

Uploaded by

rachana.c.p04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp.

, 132-148

PREDICTIVE ANALYSIS OF HEART DISEASES WITH MACHINE LEARNING APPROACHES

Ramesh TR1, Umesh Kumar Lilhore2, Poongodi M3, Sarita Simaiya4,

Amandeep Kaur5*, Mounir Hamdi6

1Sri Vidya Mandir, College of Arts and Sciences, Salem, Tamilnadu, India

2,4,5Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India

3,6College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar

Email: ramesh.rajamanickam@gmail.com1, umesh.lilhore@chitkara.edu.in2* (corresponding

author), dr.m.poongodi@gmail.com3, sarita.simaiya@chitkara.edu.in4* (corresponding author),
amandeep@chitkara.edu.in5* (corresponding author), mhamdi@hbku.edu.qa6

DOI: https://doi.org/10.22452/mjcs.sp2022no1.10

ABSTRACT

Machine Learning (ML) is used in healthcare sectors worldwide. ML methods help in the protection of heart
diseases, locomotor disorders in the medical data set. The discovery of such essential data helps researchers gain
valuable insight into how to utilize their diagnosis and treatment for a particular patient. Researchers use various
Machine Learning methods to examine massive amounts of complex healthcare data, which aids healthcare
professionals in predicting diseases. In this research, we are using an online UCI dataset with 303 rows and 76
properties. Approximately 14 of these 76 properties are selected for testing, which is necessary to validate the
performances of different methods. The isolation forest approach uses the data set’s most essential qualities and
metrics to standardize the information for better precision. This analysis is based on supervised learning methods,
i.e., Naive Bayes, SVM, Logistic regression, Decision Tree Classifier, Random Forest, and K- Nearest Neighbor.
The experimental results demonstrate the strength of KNN with eight neighbours order to test the effectiveness,
sensitivity, precision, and accuracy, F1-score; as compared to other methods, i.e., Naive Bayes, SVM (Linear
Kernel), Decision Tree Classifier with 4 or 18 features, and Random Forest classifiers.

Keywords: Heart Disease, Health Care, Machine Learning, Naive Bayes, Decision Tree Classifier, SVM, K-
Nearest Neighbor, Logistic Regression, Random Forest.

1.0 INTRODUCTION

In data science research, Machine Learning is a strategy for gaining knowledge from the prior research
experience. The conventional methods encounter several issues to solve all the problems pointed by various
researchers. The purpose of data assessment is to evaluate genuine models, which necessitates using a
dependable, robust, and trustworthy framework, including such Machine Learning methods. The Machine
Learning method likes to work with immediate input from the training sample after learning the foundational
patterns in the data [1]. The possible outcome of the whole training phase is an automatic framework. The
proposed framework is best for static and dynamic datasets. A prediction of data is an outcome of the training
and testing stage. A testing stage utilizes some information collection, i.e., datasets for training, dataset
for testing, and some specific type of ML models, i.e., classification or clustering models. [2].

Over the past decade, heart disease (cardiovascular) has been the leading factor of fatality globally.
According to the World Health Organization, about 23.6 million people die primarily as a cause of
cardiovascular disorders, with coronary heart disease and cerebral stroke accounting for 82 percent of these
deaths. The ML methods are more precise and efficient than other methods without any human assistance [3].

The ML model mainly takes input data, i.e., any text, images. After that, split the dataset into two sections:
training dataset and testing dataset. The training model is created by using a training dataset. Then we can
apply the testing dataset over the trained model. This trained model will produce the results, i.e., outcome A,
B, and C. There have been several applications of Machine Learning methods to everyday life. The high

132
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

dimensional of records is a widely known issue in Machine Learning; the sets of data researchers utilize
enormous amounts of information, and researchers occasionally cannot see it even in 3D, known as the
dimensional curse [4]. A diversity of Machine Learning methods is used in the classification and forecasting
of various cardiovascular based diseases. An automatic classification model that distinguishes between
patients with heart disease who may be at extremely high risk and those at a low level of risk can also be used
to identify them [5].

Higher blood pressure, diabetes, high cholesterol, obesity, tobacco, and a family history of heart disease are
potential causes. According to the research [4], the significant factors which cannot be modified are gender,
hereditary factors, and age. Another factor in this data source is thalassemia, which is determined by
hereditary factors. Other factors include higher blood pressure, tobacco regularity, high blood cholesterol,
lack of physical activity, overweight, physical sickness, stress level, alcohol consumption, and an irregular
diet [6].

The proposed study of article [7] makes significant contributions by developing machine-learning-based
health care innovative methods for predicting heart disease. Various ML prediction methods, i.e., regression
models, Random Forest, SVM, Decision Tree Classifier, KNN, and Naive Bayes, were utilized in the
research [8] to classify the patients, i.e., no heart disease(healthy) and with heart disease(unhealthy). All the
relevant and interrelated functions that significantly affect the anticipated significance were determined using
minimal redundancy maximal relevance, shrinkage, relief, and selection operators. Techniques of cross-
validation, i.e., the k-fold validation method, were used. Different performance metrics, i.e. precision, F1-
Score, AUC, and recall, were determined to measure and analyse the efficiency of the various classification
algorithms [9].

The following are the critical contributions created by the research proposal:

• This study thoroughly investigated the accuracy, precision and processing performance of many ML
classification methods.

• The classifiers’ effectiveness is evaluated using k-fold cross-validation by selecting efficient feature
selection methods (FSM), i.e., filter, wrapper.

• The research will also reveal which feature extraction method can be used to develop a massive learning
algorithm for diagnosing heart disease using classification models.

• The proposed forecasting system will also remove the various anomalies and missing values from the
heart disease dataset.

• The proposed system will also find efficient and precise data pre-processing models with better accuracy.

The research article contains various sections. These sections mainly cover literature review, the challenges in
current work, existing Machine Learning methods and critical features, description of heart disease data sets,
proposed model, implementation process, simulation results, result comparisons, and, finally, the conclusions.

2.0 LITERATURE REVIEW

In the current era, the most severe disease is heart disease. An early diagnosis of heart diseases is essential for
healthcare practitioners in preventing their patients from contracting the disease and saving lives. Various
classification methods were evaluated to anticipate all the heart disease cases in the dataset [10]. Hypertension,
kidney disease, lipid fluctuations, exhaustion, and many other features make. Numerous data analytics techniques
were proposed to help medical professionals recognize most of the early symptoms of Heart Disease for many
years [11]. Table 1 shows the summary of the reviewed literature and the level of efficiency of Ml Algorithms, which are
examined to determine the research gap. It also aids in identifying essential aspects in the cardiac disease database and the
prediction of improved performance, precision, and accuracy.

133
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

Table 1. Comparison of various existing ML methods

Reference Method Key Findings Dataset Challenges

[12] Feature selection algorithm Improved accuracy Heart disease Perform better only
(FCMIM), SVM results for heart dataset for small dataset
disease dataset (Cleveland)
[13] Machine Learning, Better throughput Clinical dataset Only a few
iPSCs and omics of heart disease parameters were
implemented.
[14] Hybrid Machine Learning, Better accuracy (87.8 Cardiovascular Limited features
Hybrid Random Forest %) disease dataset
[15] Machine Learning method, Accuracy (85%), Kaggle Online Accuracy can be
i.e. logistic regression, SVM sensitivity (89%), and dataset improved.
Specificity (81 %)
[17] Machine Learning methods Higher Accuracy Locomotor Limited features
disorders, Heart dataset.
diseases dataset
(UCI)
[18] Machine Learning Models Improved precision, UCI Online Limited features
(KNN, Logistic regression F1-score Heart
Neural Network) (Cleveland)
disease dataset
[19] Machine Learning Methods Higher Precision, Online dataset Limited features
and Accuracy
[20] IoT, Machine Learning Accuracy (97.5 %) Heart Limited features
Methods, SVM Disease dataset
[21] Deep Learning, Neural Better precision Online UCI Limited dataset
Network Dataset
[22] The statistical model X2 was Predictions were UCI dataset Accuracy and time
used with DNN and ANN aligned using clinical can be improved
data parameters
[23] Various Machine Learning Better measurements Hungarian- Dimension issues,
classification techniques and and select Cleveland data accuracy
Principal component analysis characteristics results
have been used to anticipate
heart disease
[24] The feature extraction was Prediction accuracy Online clinical Time and accuracy
done with an Adaboost improved dataset
classifier and a PCA
combination
[25] SVM, KNN Accurate prediction of UCI dataset Precision can be
heart disease improved
[26] Naive Bayes and SVM were Classification of heart Kaggle dataset Features selection
used as classifiers disease dataset, cause and classification
of heart disease, performs slower
diabetes
[27] Naive Bayes, Decision Tree Classify heart Online dataset Accuracy can be
Classifiers, RF, Neural diseases, better UCI improved
Networks accuracy results
[28] k-NN algorithm Feature selection, Kaggle dataset Feature
Classification Categorization
can be improved.

134
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

A variety of ML methods were used to classify and anticipate the Heart disease source data, including
Stochastic Gradient Descents method, KNN, Naive Bayes method, Support Vector Machine method, Ada
boost method, Decision Tree Classifier method, J48 method, JRIP method, and some others [12].

In the research article [13], the researcher suggested an efficient method to findout the presence of heart
disease utilizing the Back-Propagation (BP) feature extractor of ANN on available online heart disease
database categorizations. In research [14], ML learning methods with ANN were used to predict heart disease
cases. In this work, researchers have created an automation application to anticipate the sensitivity of a heart
disease trained by using some primary symptoms, i.e., disease period, gender, heartbeat rate, and history. The
outcomes of the ANN model were shown as the most accurate and precise algorithm for the forecasting of
heart disease contrasted to various ML models.

In the research [15], a hybrid algorithm was recommended to accurately predict and classify the risk of heart
disease in various age groups. This research introduced an automation model for precise data analysis using a
neural network. The proposed model predicts heart conditions inside the initial stages accurately. It has been
demonstrated that assessing an individual’s threat quality by utilizing ML techniques, i.e., DT, KNN, NB, and
GA, can be more significant when using characteristics and variations of the methods mentioned above.

The researcher [16] utilized a dimension reduction procedure to pre-process the data source and eliminate the
values and outliers. The dataset has fourteen disorder characteristics, but researchers used four variables for
performance monitoring: Confusion Matrix, Sensitivity, Specificity, and Accuracy. The researchers obtained
the highest classification precision of 93%, better than earlier described classification methods.

3.0 METHODOLOGY

Supervised and unsupervised learning are subcategories of Machine Learning and AI. These methods help in
the training process. It mainly utilized two classes of the dataset, i.e., labelled, unlabeled. This research
presents an automation ML model to predict and classify heart disease [29]. Figure 1 shows the architecture of
the proposed ML-based model; it includes phases of data pre-processing, assessment, feature selection, model
training, testing, and comparing outcomes.

3.1 Decision Tree Classifier

It mainly represents a process incorporating a tree-like growth model of pronouncements and their promising
outcomes, including occurrence implications, cost elements, and performance characteristics. It is yet another
strategy for demonstrating a preliminary optimization approach. A Decision Tree Classifier is among the most
frequent types of classification models. It is just a flowchart-like structure with a network that represents a test
on a function. A Decision Tree Classifier, as stated earlier, splits the classifications into sub-sets (i.e., root, left
child, right child). In the sample group chosen, it is the most widely used approach [30].

Iterative Dichotomiser-3 (ID-3) designed by the researcher [31] is among the most prevalent Decision Tree
Classifier methods, as it produces all possible intelligent Decision Tree Classifiers and chooses the best. When
particularly in comparison to the learning algorithm, the learning time is shorter. The number of items and
characteristics in the training data set determines the precise complexity of Decision Tree Classifiers. It is not
based on any assumptions about probability distributions. To great accuracy, a Decision Tree Classifier
algorithm can manage high-dimensional statistics. Information Gain & Gain Ratio are the essential attribute
selection measures (ASM) [32].
3.1.1 Information Gain Value (IGV)

It mainly represents the reduction in entropy value. To analyze various given features and information gain
method measures the variation among entropy values before and after partitioning the dataset. The information
gain (Decision Tree Classifier model) is first used by the ID-3 process (“Iterative Dichotomiser-3”) decision
three techniques [33]. Equation (1) describes the formula for information.

𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝐷) = ∑𝑚
𝑖=1[𝑝𝑖 ∗ 𝑙𝑜𝑔𝑙𝑜𝑔2 ] ∗ 𝑝𝑖 (1)

135
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

Where 𝑝𝑖 represents the likelihood for an arbitrary tuple in D which is mainly related to positive Class Ci,
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝐷), represents the information for tuple D.

𝑉
|𝐷𝑗|
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 A(𝐷) = ∑ 𝑋 ∗ 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝐷𝑗 ) (2)
|𝐷|
𝑖=1

𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 A(𝐷) (3)

The Equation (2) shows the information gain value, and Equation (3) shows the gain value.

Where:
• 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 A(𝐷) Represents the mean quantity of data required to determine a tuple’s classification
model in 𝐷.
• The weight of the jth division is represented by [|𝐷𝑗| /|𝐷|].
• Data A(D) is the predictive information necessary to classify a tuple from D based on A’s extraction of
features.
• The characteristic A of Node N (ij) has the maximum information gain; hence Gain (A) is chosen as the
dividing feature.

3.1.2 Gain Ratio Value (GRV)

The attribute with several consequences has a bias throughout information gain. It indicates that it favours the
attribute with some of the most decision variables. Study the possibility of a unique identification number,
such as customer ID, which has nil information (D) based on high purity separation. It thus maximizes the
quantity of knowledge gained while also introducing unnecessary partitioning. A gain ratio is an extended
version of information gain used primarily by Decision Tree Classifier algorithm C4.5 [17], which improves
existing ID3.0. A standardized division method is used to improve the information gain. A J-48 method is a
Java integration of the Decision Tree Classifier method C4.5, available in the WEKA data mining platform
[34].

3.2 Naive Bayes

It is a classifier based on supervised learning that solves regression and classification tasks. It is based on Bayesian
statistics. The basic concept of the least-squares underpins a binary, i.e., two-class and multi-class
classification. The technique is most intuitive for binary classification and also for variable input data [35].
The Naive Bayes framework is straightforward and well-suited to large data sets. As compared to other
Machine Learning methods, it produces a higher precision.

A Bayes theorem calculates the probability of an incident happening based on the possibility of a previous
incident. Equation (4) expresses the mathematical model-

𝐵𝑖
𝐴𝑖 𝑃𝑟𝑜𝑏 ( ) ∗ 𝑃𝑟𝑜𝑏(𝐴𝑖)
𝑃𝑟𝑜𝑏 ( ) = 𝐴𝑖 (4)
𝐵𝑖 𝑃𝑟𝑜𝑏(𝐵𝑖)

Where 𝐴𝑖 and 𝐵𝑖 represent events, here, once event 𝐵𝑖 is true, we need to find the probability of the event
set𝐴𝑖. Evidence is another term for event 𝐵𝑖.

• Ai’s priority is 𝑃𝑟𝑜𝑏 (𝐴𝑖) ( The prior probability, such as the possibility of an event occurring prior to
actually confirmation, has been received.).
• Here 𝑃𝑟𝑜𝑏 (𝐴𝑖, 𝐵𝑖) is the possibility of the number of factors determining of 𝐵, i.e. the possibility of
activity continuing once reported.
• The proof is a consistent value to an unidentified instance’s characteristic (here, it is event 𝐵i).

136
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

3.3 Random Forest

It is a supervised technique based ML classifier that designs and builds models using Decision Tree Classifiers.
Trees, in general, learn abnormal behaviour and overfit the trained model with minor differences and bias. It is used
to decrease the variance among features in a given dataset. It also helps in classification same training dataset and
testing datasets and emerges at the cost of a modest bias increase. Various companies such as banking and online
use this method to estimate objectives. An ensemble methodology is used to classify, predict the future, and
perform specific activities. If researchers try to classify something, the Random Forest will generate a class that
almost all trees have chosen [36]. Random Forests give the effects of K-fold cross-validation.

Both Scikit-learn and Spark provide details on the impurity factor equations in their evidence. Users can use both
parameters of Gini impurity by default and set their variance as a substitute for categorization. In regression, both
of the parameters use mean square error to determine variance reduction. Variability reduction can also be
calculated in Scikit-learn using mean absolute error [37].

𝐺𝑖𝑛𝑖 𝐼𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − 𝐺𝑖𝑛𝑖 (5.1)

𝐺𝑖𝑛𝑖 = P12 + P22 ∗ P32 … … . . +Pn2 (5.2)

The Equation (5.1) represents the Gini impurity formula. Where P1 … … Pn represents the probabilities of
each possible class in solution space, 𝐺𝑖𝑛𝑖 represents the purity, and 𝐺𝑖𝑛𝑖 𝐼𝑚𝑝𝑢𝑟𝑖𝑡𝑦 represents the impurity of
a particular node. Here 𝐺𝑖𝑛𝑖 works only for categorical targets.

3.4 K-Nearest Neighbor

It is a supervised Machine Learning methodology for the solution of regression as well as classification
complications. It is simple to set up and acknowledge, but it does have the disadvantage of being noticeably
slower as the volume of data in use expands. As a result of its high level of accuracy, the kNN method can
directly compete with precise existing models. If people need high precision and yet do not need a human-
readable method, the KNN algorithm is perfect. We can mainly evaluate forecasts based on distance metrics
[38].
The best algorithm for the given data set is complicated and depends on several samples, features, and
dimensions. Datasets are used in Machine Learning as they need an intelligent analysis. For a search location,
the degree of neighbours is demanded. [39].

• Most of the time, the value of k has little effect on the brute force search period.
• The time it takes to search a ball tree, or even a KD tree, slows down as k grows due to the presence of
two components, higher k and a substantial chunk of the dimensional space. Second, the tree is traversed
using (k>1) and needs an internal queue for results.

When the k is more than N, the capacity to prune branches in a tree-based query is limited. Brute force queries
may be more efficient in this circumstance. A construction step is required for both the ball tree and the KD
Tree. When amortized across a large number of queries, the cost of this architecture becomes trivial. On the
other hand, construction can account for a large portion of the total cost if only a few queries are run. A brute
force method is superior to a tree-based method when only a few query points are required [40].

3.5 SVM Method

It is a supervised classification based ML method that can be utilised to find solutions for different regression
and classification tasks. Moreover, this is often used to overcome regression and classification tasks. The
method is analyzed as a transition phase within an n-dimensional neighbourhood (n represents the number of
features) [41]. It is used to solve complex problems that cannot be solved linearly. A “kernel trick” function
can be used to find non-linear solutions to any problems using SVM efficiently. In SVM, the statistics are
plotted into a high-dimensional area where the challenge can be separated sequentially. Classifier chooses a
division line with the highest deficit. Every point is considered as a support vector. A fully trained
classification algorithm can categorize any test sequence and anticipate the training examples (generalization)

137
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

[42].
A Discriminant function feature can be used to determine the classification model. Equation (6) represents the
formula for the classification model function.

𝐹(𝑋𝑖) = [𝑊𝑋𝑖 + 𝑏] (6)

Here 𝑋𝑖, denotes a training or test pattern, w represents the weight vector value, and b represents the bias
correction factor related to the function. A total of the product line of vector elements can be determined using
the input space and vector combination [𝑊𝑖 ∗ 𝑋𝑖] and is denoted by the symbol 𝑊𝑋𝑖 [43].

The regression model in the system of two functionalities is as follows Equation (7)-

𝐹(𝑋𝑖) = [𝑊1 ∗ 𝑋1 + 𝑊2 ∗ 𝑋2 + 𝑏] (7)

After training, the SVM provides us with estimates for 𝑊1,𝑊2, and 𝑏.

3.6 Logistic Regression

It is yet another method that Machine Learning has obtained out from the domain of statistics. It is mainly
preferred for classification problems based on two-class features. It is a statistical technique that predicts
received data derived from previous findings from the given data set. A logistic regression method anticipates
multiple data parameters by studying the connection among one or many pre-existing predictor factors [44].
The logistic regression equation can be defined as described in the equation-

[𝑒 (β0 + β1∗𝑥+β2) ]
𝑦 = (8)
[(1 + 𝑒 (β0 + β1∗𝑥+β2) )]

Where:

The predicted value represents by variable, A the bias term represented by β0, and β1 represents the single
data value coefficient (x), and β2 represents the double value. The training examples should learn the β
coefficient (the actual statistic which is stable) for each section in the data input.

4.0 RESULTS AND DISCUSSION

This section provides an analysis of different attributes associated with various heart diseases. This
experimental analysis is based on a machine learning supervised classification technique, i.e., SVM, KNN,
Decision Tree Classifier, RF, and Naïve Bayes.

In this research, we utilise an online UCI heart disease dataset [21]. The implementation of various ML
techniques has been done using python programming under Anaconda distribution.

4.1 Dataset Description

The data set for this study is retrieved from the UCI Center Machine Learning database (online data set) [21,
25]. The database consists of a total of 4 data sources collected from four healthcare institutions. Compared to
other data sets, the Cleveland data source has very few missing features and more records.

This dataset contains the record of 303 patients. Although there are 76 characteristics in this data set, all
released studies only use a subcategory of fourteen of them. It has a range of zero (no existence of disease) to
four (existence of disease). An experimental study using the Cleveland data set focuses on differentiating
between appearance (attributes 1, 2, 3, and 4) and absence (attribute 0). Table 2 shows the various parameters
related to the heart disease data set.

138
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

Table 2. Attributes in Heart disease dataset (UCI)

S. No. Parameter Description Values

I. Age Age in years Numeric values
II. Sex/Gender Gender type, i.e., Male or Female 1: Male, 2:Female
III. CP Chest Pain Level Four types of Chest pain
(0,1,2,3)
IV. Trestbps Blood pressure vale at the time of rest < or > 120 Mg/DL
V. Chol Represents the level of Serum Cholesterol Numeric values
VI. FBS Represents the level of sugar in fasting blood Numeric values
VII. Restecg Represents the level of Resting Five types of Values (0,1,2,3,4)
electrocardiographic
VIII. Thalach Maximum heart rate level Numeric values
IX. Exang Exercise enduced level Yes / No
X. Oldpeak ST level during the workout, compared with Numeric values
the results of rest taken
XI. Slope level of peak exercise in ST-segment Three values (0:up, 1:flat,
2:down)
XII. CA Reprenets the number of flourosopy vessels Four values (o to 3)
XIII. Thal Used for Defect classes (4 classes) Four values (0 to 3)
Normal; fixed; reversible; Non-reversible
XIV. Class Representing the Target Two classes (0,1)

4.2 Pre-Processing of Heart Disease Dataset

There are no apparent limitations in the dataset, and they are also not evenly distributed. The dataset includes a
significant number of incomplete as well as noisy values. These statistics are pre-processed to tackle missing
value difficulties and generate accurate forecasts. The pre-processing data phase consists of several stages,
including data cleaning, transformation (normalization and aggregation), data integration, and reduction [39].
The proposed system is utilized two different approaches for data pre-processing, i.e., normalization,
aggregation. However, the obtained results after using a customarily distributed dataset to overcome the
overfitting issues and then utilizing Exclusion Forest for feature extraction and anomalies removal are pretty
enticing. The skewness of statistics, outlier recognition, and data distributions are checked using numerous
plotting methods [32]. Figure 2 shows the Correlation between Heart diseases and numeric features. This
matrix shows the summary of the overall dataset as input towards a more advanced analysis and identification
of more investigation. Each of the cells here represents the Correlation between variables.

Fig. 2. Correlation between Heart Diseases and Numeric Features

139
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

4.2.1 Verifying Data Distribution

We need to analyse the data pattern in order to categorise the heart disease dataset and ensure precise predictions about
heart disease levels. Heart disease positive cases are 54% in this heart disease dataset, whereas no heart disease
cases are 45%. Perhaps the dataset must be balanced to prevent overfitting problems. Figure 3 provides the
heart disease dataset’s representation with disease statistics. Here 1 represents data with heart disease, and 0
represents data with no heart disease. This dataset contains 165 candidates with heart disease and 138 without
any heart disease, showing the distribution of positive and negative heart disease cases. It will help an ML
method identify which patterns in the dataset can be best for the process [40].

Fig. 3. Distribution of data (1: Positive, 0: Negative)

Figure 3 shows the representation of the candidate count of heart disease in the dataset, here 1 represents data
with heart disease, and 0 represents data with no heart disease. This dataset contains 165 persons with heart
disease and 138 persons without heart disease. Now we apply the Isolation Forest method to remove the
anomalies.
Isolation Forest (Anomalies Removal): We use an Isolation forest method for anomalies removal from the
dataset. Randomly sub-sampled data is processed in an Isolation Forest in a hierarchical structure depending
on random selection characteristics. Anomalies are less likely to appear in samples extending further down the
tree since they need more branches to separate individuals.

Algorithm 1: Isolation Forest

This algorithm will remove the anomalies from the dataset.
Input: Data with Anomalies
Output: Data with no anomalies
1. A random sub-sample of the dataset is chosen and allocated to a binary tree while supplied data.
2. The tree is first split by choosing a random characteristic (from collection among all N
characteristics). Then there is a branching randomly sampled threshold.
3. If a data point’s quantity is below the threshold, it flows to the left new branch; otherwise, it travels
towards the right-side tree. As a result, a node is broken into two parts, i.e., left subtree and right subtree.
4. This method is repeated until each piece of data is entirely separate or until the maximum level will not
be achieved.
5. To create random binary forests, repeat all procedures above.

140
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

4.2.2 Verifying the Asymmetry of a Distribution (Skewness)

Various methods and graphs obtain numerous distribution results. To verify various parameters and also determine the
asymmetry (skewness) of the heart diseases data set. Separate plots are displayed so that a comprehensive review of
the information will be further examined. The potting includes various distributions: a) age and gender distributions,
b) trestbps and chest pain distributions, c) fasting blood and cholesterol distributions, d) thalach and
electrocardiogram distribution, e) resting electrode and distributions, f) old peak and exang distributions, g) ca and
slope distributions, and h) target and thaI distributions [28].

Figure 4 represents the ratio based on sex (1: male and 0: female). It represents the heart disease count based on
Sex, 1: Have heart disease and 0: have no disease. The graph indicates that 114 females and 93 males have no heart
disease, whereas 23 females and 72 males have heart disease out of 303 patient datasets. These findings show that
males are more likely than females to suffer from cardiac disease.

Fig. 4. Heart disease ratio based on sex (1: Have heart disease and 0: Have no heart disease)

Fig. 5. Heart disease ratio based on Chest pain level

Figure 5 represents Heart Disease by Chest Pain Type (0, 1, 2 and 3). This graph indicates that in male type-0,
chest pain cases are 70, type 1: 31, type 2: 50 and type 3: 56 cases. The results indicate chest pain type -0;
patients are higher as compared to other possible types. Similar in type-0 chest pain cases 40, type-1 chest pain
cases 25, type 2 chest pain cases 11 and type 3 chest pain cases are 19. The results indicate chest pain type -0;
patients are higher as compared to other possible types.

141
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

Fig. 6. Heart disease based on fasting blood sugar >120 (1: Heart disease and 0: No heart disease)

Similar Figure 6 represents Heart disease based on fasting blood sugar >120 (1: Heart disease and 0: No heart
disease). This graph indicates that positive heart disease cases based on FBS value (Greater than 120) are 23
Male and 22 Females, and type negative (no heart disease) cases are 142 males and 116 females. The results
indicate that male patients are due to FBS level as compared to females.

4.2.3 Exploratory Data Assessment

During this procedure, we obtained important measurement information from the statistics, so we have checked
the data distributions process of the different characteristics, such as the correlations of variables with the
destination variable, various probabilities. Figure 7 shows the cardiovascular disease dataset’s overview. This
dataset contains 14 attributes, and each attribute has different values, including count, mean, std, min, and max.
This dataset contains 165 persons with heart disease and 138 persons without heart disease. Similar Figure 8
shows the parameter distribution and their values in the dataset.

Fig. 7. Analysis of data of all numeric parameters in the data set

Figure 9 shows the distribution of heart disease parameters restecq, exang and slope. This graph is plotted
based on heart disease state, i.e., have heart disease and no heart disease. Figure 10 and Figure 11 show the
distribution of heart disease parameters ca, thal, target, sex, cp, FBS. This graph is plotted based on heart
disease state, i.e., have heart disease and no heart disease.

142
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

Fig. 8. Heart disease data set parameters and their values

Fig. 9. Distribution of parameters based on restecq, exang and, slope

Fig. 10. Distribution of parameters based on restecq, exang and, slope

Fig. 11. Distribution of parameters based on sex, cp and fbs

4.3 Comparison Parameters

To assess the effectiveness of the forecast modelling techniques, we compare them to each other. The following
performance measuring parameters have been used in this research [30].

143
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

Here TP= (true positives), FP= (false negatives), TI = Total Instance.

• Accuracy: the fraction of occurrences in the complete dataset has been successfully forecast out of all
incidences, as described in Equation (9)
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (9)
(𝑇𝐼)

• Sensitivity / Recall: the fraction of correctly predicted set of data occurrences that are positive [29].
Equation (10) describes the recall equation.
𝑇𝑃
Sensitivity or Recall = (10 )
(𝑇𝑃 + 𝐹𝑃)

• Precision: It measures how many positive forecasting have been accurate. The precision determines the
accurateness for such a minority class as a result. It is measured by deducting the count of positive
predictions productive instances by the total number of samples instances anticipated [30]. Equation (11)
describes the equation for precision.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = [ ] (11)
𝑇𝑃 + 𝐹𝑃

• F1-Score: Precision and recall are combined in a composite harmonic mean (mean of reciprocals).
Researchers initially assess the model’s accuracy or ability to recognize only each relevant data source
to do the same [22].
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1𝑆𝑐𝑜𝑟𝑒 = 2 × [ (12 )
(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙)

• UC (Area under the curve): A probability approximate for a framework ranking a random selection
positive case larger than a selected randomly negative specific instance. The filled area under the curve
receiver required to operate characteristic curves is drawn to visual elements to evaluate the models’
effectiveness.
𝑦 = 𝑓(𝑥) 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑥 = 𝑎 , & 𝑥 = 𝑏 (13 )

4.4 Experimental Findings

In this section, the classification techniques and consequences have been discussed from different viewpoints.
The experimental analysis has been performed by using Python programming language in Anaconda
distribution. Experimental results represent the correlation measure results. We are using 60% dataset for
training and 40% for testing.

Correlation parameters are determined to measure the strength of the association among multiple numerically
measured dependent variables. The effectiveness of classification models has been evaluated based on a set of
criteria. Furthermore, a k-fold cross-validation procedure has been applied. Performance measurements have
been used to monitor the efficiency of classification methods. All the essential features are normalized and
standardized to categories the actual data.

4.4.1 K-Fold Cross-Validation Outcomes

The exclusive features of the input data were examined by applying 10-fold cross-validation approaches on several
ML classification models. There were two components to the dataset: the training dataset and the testing dataset.
About 90% of the data was utilised for training and only 10% for testing. Table 3 shows the results of 10-fold
cross-validation of classification models with exclusive features.

144
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

Fig. 12. Accuracy % of various ML Methods

Table 3. Experimental Results for various ML Methods

ML Method Sensitivity / Re- Precision F1-Score AUC (Area under
call curve)
Decision Tree (DT) Method 0.7548 0.778 0.864 0.812
Naive Bayes (NB) Method 0.804 0.854 0.865 0.845
Random Forest (RF) Method 0.832 0.887 0.879 0.898
k-Nearest Neighbor (KNN) 0.948 0.917 0.908 0.799
Method
SVM(Linear Kernel) Method 0.878 0.907 0.885 0.809
Logistic Regression (LR) 0.861 0.881 0.865 0.813
Method
Table 3 and Figure 12 show the experimental results for various Machine Learning methods. The KNN
classification achieved the maximum score when the total number of the nearest Neighbors was eight. The
linear kernel performed the best for this data set with the maximum score among the four kernels; Linear,
Poly, RBF, and Sigmoid. In the Decision Tree Classifier algorithm, the score is maximum, with the total
number of features to be selected being either 4 or 18. The research involved the analysis of patient data with
pre-processing. After that, the classification models, i.e. KNN, SVM, Decision Tree Classifier, and Random
Forest, were trained and tested with maximum scores.

The experimental result for Decision Tree Classifier method is showing accuracy (0.7898), recall (0.7548),
precision (0.778), F1 score (0.864) and ACU (0.812). Similarly, the Naive Bayes Machine Learning method is
showing accuracy (0.8678), recall (0.804), precision (0.854), F1-score (0.865) and AUC (0.845). The
experimental result for the Random Forest method is showing accuracy (0.879), recall (0.832), precision
(0.887), F1-score (0.879) and AUC (0.898).

Similarly, for k-Nearest Neighbor method, the experimental results are accuracy (0.941), recall (0.948),
precision (0.917), F1-Score (0.908) and AUC (0.799). Similar for SVM method experimental results are
accuracy(0.891), recall(0.878), precision(0.907), F1-Score(0.885) and AUC(0.809) and for Logistic regression
method, the experimental results are accuracy(0.871), recall(0.861), precision(0.881), F1-Score(0.865) and
AUC(0.813).

4.4.2 Confusion Matrix for the Classifiers

Different performance evaluation metrics have been implemented in this research work to assess the classifier
performance. Each observation in the test dataset is forecast in a precisely single box using the confusion
matrix. Here are two responding subclasses; hence the matrix becomes [2*2]. Table 4 shows a sample
confusion matrix for actual and predicted data samples. Table 5 and Table 6 shows the confusion matrix

145
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

results for various methods during the training and testing phases.

Table 4. Sample Confusion Matrix

Class Actual Positive Actual Negative

Predicted Positive TP FP
Predicted Negative FN TN

Table 5. Confusion Matrix Results for Training (Heart Disease Dataset)

Method Name TP FP FN TN
LR Method 80 17 11 104
SVM(Linear Kernel) Method 89 8 6 109
KNN Method 82 15 13 102
RF Method 97 0 0 115
DT Classifier Method 97 0 0 115
NB Method 97 0 0 115

Table 6. Confusion Matrix Results for Testing (Heart Disease Dataset)

Method Name TP FP FN TN
LR Method 34 7 5 45
SVM(Linear Kernel) Method 36 5 6 44
KNN Method 35 6 6 44
RF Method 33 8 8 42
DT Classifier Method 34 7 13 37
NB Method 33 8 8 42

4.4.3 Experimental Results Based on Selected Features using Filter Method

In clinical diagnosis, the feature selection techniques support the selection of accurate health care decisions.
Table 7 shows the relevant seven features which are selected using the filter feature selection technique. Figure
13 shows the feature score of the essential features. Chest pain seems to be a significant aspect for predicting
heart disease in the rankings graph. We did the test on a range of different amounts of persons. The results
demonstrate the strength of SVM and Random Forest methods with the selected features.

Table 7. Experimental results Based on Selected Features using the Filter method

Feature Name of the Feature Code Feature Score

4 Age Ag 0.231
8 Sex(Male, Female) Sex 0.114
11 Chest Pain CP 0.148
12 Fasting Blood Sugar FBS 0.151
3 Cholesterol Level CL 0.221
10 slope slope 0.232
9 Ca CA 0.114

4.4.3 Experimental results for Machine Learning methods

The final prediction rate (heart disease prediction) results for all the Machine Learning classifier techniques
are presented in Figure 14. The SVM method shows 83.25%, Decision Tree Classifier 83.89 %, KNN 86.45,
Random Forest 88.35, Logistic Regression 84.22% and Naive Bayes 84.69% prediction score. The
experimental result demonstrates that the Random Forest classifier technique has a better prediction rate for
detecting heart disease than other models.

146
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

Fig. 13. Performance-based on selected features for various classifiers

Fig. 14. Prediction Rate % for Machine Learning Classifiers

5.0 CONCLUSION AND FUTURE WORK

This research article has worked on four different methods for conducting comparative assessment and
obtaining true positive performance. We concluded that Machine Learning methods significantly outperformed
statistical techniques in this research. This article proves the studies of various researchers which suggest that
the use of ML models are the best choice to predict and classify heart disease even if the database is more
diminutive. Various performance parameters, i.e., precision, F1 score, accuracy, and recall, have been
compared for the entire Machine Learning classification techniques on the UCI online heart disease data set.
The KNN classification algorithm outperformed the fourteen available parameters.
The future scope of this work, and as limitations imposed to the efforts made above, the emerging
classification models can be included for complex time series data set. In addition to this, the research can be
applied to other biological diseases.
Conflicts of Interest
Concerning the publication of this study, there are no conflicts of interest among the authors.
Availability of Data
Upon request, the statistics used to verify the findings of this research can be obtained from the corresponding
author.

147
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

REFERENCES
[1] J. P. Li, A. U. Haq, S. U. Din, J. Khan, A. Khan, and A. Saboor, “Heart disease identification method
using Machine Learning classification in e-healthcare,” IEEE Access, Vol. 8, 2020, pp. 107562–107582.
[2] M. Mullen, A. Zhang, G. K. Lui, A. W. Romfh, J. W. Rhee, and J. C. Wu, “Race and Genetics in
Congenital Heart Disease: Application of iPSCs, Omics, and Machine Learning Technologies,” Frontiers
in Cardiovas- cular Medicine, Vol. 8, 2021, pp. 37–51.
[3] S. Mohan, C. Thirumalai, and G. Srivastava, “Effective heart disease prediction using hybrid Machine
Learning techniques,” IEEE Access, Vol. 7, 2019, pp. 81542–81554.
[4] N. K. Trivedi, S. Simaiya, U. K. Lilhore, and S. K. Sharma, “An efficient credit card fraud detection
model based on Machine Learning methods,” International Journal of Advanced Science and
Technology, Vol. 29, No. 5, 2020, pp. 3414–3424.
[5] A. K. Dwivedi, “Performance evaluation of different Machine Learning methods for prediction of heart
disease,” Neural Computing and Applications, Vol. 29, No. 10, 2018, pp. 685–693.
[6] S. K. Jonnavithula, A. K. Jha, M. Kavitha, and S. Srinivasulu, “Role of Machine Learning algorithms
over heart diseases prediction,” AIP Conference Proceedings, Vol. 2292, 2020, pp. 40013–40013.
[7] S. S. Yadav, S. M. Jadhav, S. Nagrale, and N. Patil, “Application of Machine Learning for the detection of
heart disease,” 2020 2nd International Conference on Innovative Mechanisms for Industry Applications
(ICIMIA), 2020, pp. 165–172.
[8] V. Patil and U. K. Lilhore, “A survey on different data mining & Machine Learning methods for credit
card fraud detection,” International Journal of Scientific Research in Computer Science, Engineering
and Infor- mation Technology, Vol. 3, No. 5, 2018, pp. 320–325.
[9] M. Poongodi and S. Bose, “Detection and Prevention system towards the truth of convergence on
decision using Aumann agreement theorem,” Procedia Computer Science, Vol. 50, 2015, pp. 244–251.
[10] S. Nashif, M. R. Raihan, M. R. Islam, and M. H. Imam, “Heart disease detection by using Machine
Learning algorithms and a real-time cardiovascular health monitoring system,” World Journal of
Engineering and Technology, Vol. 6, No. 4, 2018, pp. 854–873.
[11] N. Pawar, U. K. Lilhore, and N. Agrawal, “A hybrid ACHBDF load balancing method for optimum
resource utilization in cloud computing,” International Journal of Scientific Research in Computer
Science, Engineer- ing and Information Technology, Vol. 3307, 2017, pp. 367–373.
[12] S. Sharma and M. Parmar, “Heart diseases prediction using deep learning neural network model,”
Interna- tional Journal of Innovative Technology and Exploring Engineering (IJITEE), Vol. 9, No. 3,
2020, pp. 124– 137.
[13] D. Singh and J. S. Samagh, “A comprehensive review of heart disease prediction using Machine Learning,”
Journal of Critical Reviews, Vol. 7, No. 12, 2020, pp. 281–285.
[14] K. Guleria, A. Sharma, U. K. Lilhore, and D. Prasad, “Breast Cancer Prediction and Classification Using
Supervised Learning Techniques,” Journal of Computational and Theoretical Nanoscience, Vol. 17, No.
6, 2020, pp. 2519–2522.
[15] U. K. Lilhore, S. Simaiya, K. Guleria, and D. Prasad, “An Efficient Load Balancing Method by Using
Ma- chine Learning-Based VM Distribution and Dynamic Resource Mapping,” Journal of
Computational and Theoretical Nanoscience, Vol. 17, No. 6, 2020, pp. 2545–2551.
[16] S. K. Sharma, U. K. Lilhore, S. Simaiya, and N. K. Trivedi, “An Improved Random Forest Algorithm for
Predicting the COVID-19 Pandemic Patient Health,” Annals of the Romanian Society for Cell Biology,
2021, pp. 67–75.

148
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

[17] U. K. Lilhore, S. Simaiya, D. Prasad, and D. K. Verma, “Hybrid Weighted Random Forests Method for
Prediction & Classification of Online Buying Customers,” Journal of Information Technology
Management, Vol. 13, No. 2, 2021, pp. 245–259.
[18] P. S. Kohli and S. Arora, “Application of Machine Learning in disease prediction,” 4th International
conference on computing communication and automation (ICCCA), 2018, pp. 1–4.
[19] A. Ismail, S. Abdlerazek, and I. M. El-Henawy, “Big data analytics in heart diseases prediction,” Journal
of Theoretical and Applied Information Technology, No. 11, 2020, pp. 98–110.
[20] A. N. Repaka, S. D. Ravikanti, and R. G. Franklin, “Design and implementing heart disease prediction
using naives Bayesian,” 2019 3rd International conference on trends in electronics and informatics
(ICOEI), 2019, pp. 292–305.
[21] V. V. Ramalingam, A. Dandapath, and M. K. Raja, “Heart disease prediction using Machine Learning
techniques: a survey,” International Journal of Engineering & Technology, Vol. 7, No. 2, 2018, pp. 684–
687.
[22] “ Heart Disease Data Set, UCI Machine Learning dataset, contains 4 databases, Access on 10th July
2021,htt ps://archive.ics.uci.edu/ml/datasets/heart+Disease.,” 2021.
[23] P. Singh, S. Singh, and G. S. Pandi-Jain, “Effective heart disease prediction system using data mining
techniques,” International journal of nanomedicine, Vol. 13, 2014, pp. 121–128.
[24] J. Patel, D. Tejalupadhyay, and S. Patel, “Heart disease prediction using Machine Learning and data
mining technique,” Heart Disease, Vol. 7, No. 1, 2015, pp. 129–137.
[25] Y. Khourdifi and M. Bahaj, “Heart disease prediction and classification using Machine Learning
algorithms optimized by particle swarm optimization and ant colony optimization,” International
Journal of Intelligent Engineering and Systems, Vol. 12, No. 1, 2019, pp. 242–252.
[26] S. M. Nagarajan, V. Muthukumaran, R. Murugesan, R. B. Joseph, M. Meram, and A. Prathik, “Innovative
feature selection and classification model for heart disease prediction,” Journal of Reliable Intelligent
Envi- ronments, 2021, pp. 1–11.
[27] R. T. Selvi and I. Muthulakshmi, “An optimal artificial neural network-based big data application for
heart disease diagnosis and classification model,” Journal of Ambient Intelligence and Humanized
Computing, Vol. 12, No. 6, 2021, pp. 6129–6139.
[28] A. Singh and R. Kumar, “Heart disease prediction using Machine Learning algorithms,” 2020
international conference on electrical and electronics engineering (ICE3), 2020, pp. 452–457.
[29] M. Poongodi, et al., “Prediction of the price of Ethereum blockchain cryptocurrency in an industrial
finance system,” Computers & Electrical Engineering, Vol. 81, 2020, pp. 106527–106539.
[30] M. A. Jabbar, B. L. Deekshatulu, and P. Chandra, “Intelligent heart disease prediction system using
Random Forest and evolutionary approach,” Journal of network and innovative computing, Vol. 4, 2016,
pp. 175–184.
[31] J. Nahar, T. Imam, K. S. Tickle, and Y. P. P. Chen, “Computational intelligence for heart disease diagnosis:
A medical knowledge-driven approach,” Expert Systems with Applications, Vol. 40, No. 1, 2013, pp. 96–
104.
[32] A. M. Sagir and S. Sathasivam, “A Novel Adaptive Neuro-Fuzzy Inference System Based Classification
Model for Heart Disease Prediction,” Pertanika Journal of Science & Technology, No. 1, 2017, pp. 25–30.
[33] M. Poongodi, M. Hamdi, V. Varadarajan, B. S. Rawal, and M. Maode, “Building an Authentic and
Ethical Keyword Search by applying Decentralized (Blockchain) Verification,” IEEE INFOCOM 2020-
IEEE Con- ference on Computer Communications Workshops (INFOCOM WKSHPS), 2020, pp. 746–
753.
[34] M. Poongodi, M. M. Hamdi, A. Malviya, G. Sharma, S. Dhiman, and Vimal, “Diagnosis and combating
COVID-19 using wearable Oura smart ring with deep learning methods,” Personal and ubiquitous
computing, 2021, pp. 1–11.

149
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022
Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp., 132-148

[35] X. Liu, et al., “A hybrid classification system for heart disease diagnosis based on the RFRS method.,”
Computational and mathematical methods in medicine, Vol. 24, No. 5, 2017, pp. 214–224.
[36] S. Simaiya, U. K. Lilhore, D. Prasad, and D. K. Verma, “MRI Brain Tumour Detection & Image
Segmenta- tion by Hybrid Hierarchical K-means clustering with FCM based Machine Learning Model,”
Annals of the Romanian Society for Cell Biology, 2021, pp. 88–94.
[37] M. Poongodi and S. Bose, “A novel intrusion detection system based on trust evaluation to defend
against DDoS attack in MANET,” Arabian Journal for Science and Engineering, Vol. 40, No. 12, 2015,
pp. 3583– 3594.
[38] M. Poongodi, N. Tu, M. Nguyen, K. Hamdi, and Cengiz, “A Measurement Approach Using Smart-IoT
Based Architecture for Detecting the COVID-19,” Neural Processing Letters, 2021, pp. 1–15.
[39] N. K. Trivedi, S. Simaiya, U. K. Lilhore, and S. K. Sharma, “COVID-19 Pandemic: Role of Machine
Learn- ing & Deep Learning Methods in Diagnosis,” Int J Cur Res Rev—, Vol. 13, No. 06, 2021, pp. 150–
156.
[40] S. Dhar, K. Roy, T. Dey, P. Datta, and A. Biswas, “A hybrid Machine Learning approach for
prediction of heart diseases,” 4th International Conference on Computing Communication and
Automation (ICCCA), 2018, pp. 1–6.
[41] A. Garg, U. K. Lilhore, P. Ghosh, D. Prasad, and S. Simaiya, “Machine Learning-based Model for
Prediction of Student’s Performance in Higher Education,” 2021 8th International Conference on Signal
Processing and Integrated Networks (SPIN), 2021, pp. 162–168.
[42] F. Babicˇ, J. Oleja´r, Z. Vantova´, and J. Paralicˇ, “Predictive and descriptive analysis for heart disease
diagnosis,” 2017 federated conference on computer science and information systems (fedcsis), 2017, pp.
155–163.
[43] A. Hassan, D. Prasad, M. Khurana, U. K. Lilhore, and S. Simaiya, “Integration of Internet of Things
(IoT) in Health Care Industry: An Overview of Benefits, Challenges, and Applications,” Data Science and
Innova- tions for Intelligent Systems, 2021, pp. 165–180.
[44] S. Simaiya, et al., “EEPSA: Energy Efficiency Priority Scheduling Algorithm for Cloud Computing,”
2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), 2021, pp.
1064–1069.

150
Malaysian Journal of Computer Science. Data Analytics, Healthcare Systems and FinTech Special Issue 1/2022

TM1914 - John Deere Motor Graders 670C, 670CH, 672CH, 770C, 770CH, 772CH Series II Technical Manual - Operation and Test
100% (5)
TM1914 - John Deere Motor Graders 670C, 670CH, 672CH, 770C, 770CH, 772CH Series II Technical Manual - Operation and Test
849 pages
Bag Filter Calculations, Mr. Bokaian's
100% (6)
Bag Filter Calculations, Mr. Bokaian's
24 pages
Lindbergh School District Directory 2013-14
No ratings yet
Lindbergh School District Directory 2013-14
24 pages
Cheat Sheet Calculus PDF
No ratings yet
Cheat Sheet Calculus PDF
6 pages
2nd Review
No ratings yet
2nd Review
21 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
10 pages
Journal To Publish Research Paper
No ratings yet
Journal To Publish Research Paper
5 pages
Heart Disease Prediction and Classification Using Machine Learning and Transfer Learning Model
No ratings yet
Heart Disease Prediction and Classification Using Machine Learning and Transfer Learning Model
7 pages
Asd 1
No ratings yet
Asd 1
6 pages
HEART DISEASE PREDICTION USING
No ratings yet
HEART DISEASE PREDICTION USING
8 pages
2023-Heart Disease Prediction Using Machine Learning
No ratings yet
2023-Heart Disease Prediction Using Machine Learning
11 pages
6 223 Rehan HeartDiseasePredictionAccuracyUsingaHybridMachineLearningApproachv2
No ratings yet
6 223 Rehan HeartDiseasePredictionAccuracyUsingaHybridMachineLearningApproachv2
6 pages
JETIR2008396
No ratings yet
JETIR2008396
6 pages
Olayinka Babe-2
No ratings yet
Olayinka Babe-2
48 pages
10 11648 J Ajcst 20220503 11
No ratings yet
10 11648 J Ajcst 20220503 11
10 pages
Paper 2
No ratings yet
Paper 2
5 pages
IEEE
No ratings yet
IEEE
8 pages
JOCC - Volume 2 - Issue 1 - Pages 50-65
No ratings yet
JOCC - Volume 2 - Issue 1 - Pages 50-65
16 pages
A Prediction of Heart Disease Using Machine Learning Algorithms
No ratings yet
A Prediction of Heart Disease Using Machine Learning Algorithms
8 pages
Heart Failure Prediction Using Machine Learning Algorithm
No ratings yet
Heart Failure Prediction Using Machine Learning Algorithm
5 pages
Heart Disease Prediction Using Supervised Machine Learning Algorithms
No ratings yet
Heart Disease Prediction Using Supervised Machine Learning Algorithms
3 pages
Heart Disease Predication
No ratings yet
Heart Disease Predication
40 pages
Heart Disease rp2
No ratings yet
Heart Disease rp2
14 pages
Galley Proof 006
No ratings yet
Galley Proof 006
4 pages
Heart Disease Paper
No ratings yet
Heart Disease Paper
10 pages
Performance Enhancement of Machine Learning System Applicable To Detect Heart Disease 2024
No ratings yet
Performance Enhancement of Machine Learning System Applicable To Detect Heart Disease 2024
9 pages
Final Year Project
No ratings yet
Final Year Project
57 pages
Heart Disease Python Report 1st Phase
No ratings yet
Heart Disease Python Report 1st Phase
33 pages
Heart Disease
No ratings yet
Heart Disease
19 pages
HEART ATTACK PREDICTION USING MACHINE LEARNING
No ratings yet
HEART ATTACK PREDICTION USING MACHINE LEARNING
21 pages
HEART_DISEASE_PREDICTION_RANDOM_FOREST_A
No ratings yet
HEART_DISEASE_PREDICTION_RANDOM_FOREST_A
7 pages
BT40962_PPT
No ratings yet
BT40962_PPT
24 pages
AI_review_1
No ratings yet
AI_review_1
5 pages
Predicting_Heart_Diseases_Using_Machine_Learning_a
No ratings yet
Predicting_Heart_Diseases_Using_Machine_Learning_a
16 pages
Predicting Heart Diseases Using Machine Learning and Different Data Classification Techniques
No ratings yet
Predicting Heart Diseases Using Machine Learning and Different Data Classification Techniques
15 pages
INTRODUCTION
No ratings yet
INTRODUCTION
14 pages
applsci-11-08352-v2
No ratings yet
applsci-11-08352-v2
22 pages
??? ??????? ?????? - ?????? ? - 1??20??403
No ratings yet
??? ??????? ?????? - ?????? ? - 1??20??403
34 pages
Evaluation of cardiovascular disease in diabetic patients using machine learning techniques
No ratings yet
Evaluation of cardiovascular disease in diabetic patients using machine learning techniques
13 pages
Nigercon Abuad IEEE 2024
No ratings yet
Nigercon Abuad IEEE 2024
5 pages
Prediction of Risk in Cardiovascular Disease Using Machine Learning Algorithms
No ratings yet
Prediction of Risk in Cardiovascular Disease Using Machine Learning Algorithms
6 pages
9.heart-disease-diagnosis-and-prediction-based-on-hybrid-30o3m8z8
No ratings yet
9.heart-disease-diagnosis-and-prediction-based-on-hybrid-30o3m8z8
6 pages
Heart Disease Prediction Using Machine Learning IJERTV9IS080128
No ratings yet
Heart Disease Prediction Using Machine Learning IJERTV9IS080128
3 pages
Heart_disease
No ratings yet
Heart_disease
6 pages
Jindal 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012072
No ratings yet
Jindal 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012072
11 pages
Farzana 2020
No ratings yet
Farzana 2020
5 pages
jut2
No ratings yet
jut2
12 pages
Final Heart Disease Prediction
No ratings yet
Final Heart Disease Prediction
26 pages
Cardiovascular Disease Prediction Using Deep Learning
No ratings yet
Cardiovascular Disease Prediction Using Deep Learning
6 pages
8438-Article Text-15156-1-10-20210606
No ratings yet
8438-Article Text-15156-1-10-20210606
13 pages
heart-disease-prediction-using-data-mining-techniques-IJERTV10IS020083
No ratings yet
heart-disease-prediction-using-data-mining-techniques-IJERTV10IS020083
7 pages
A Study On Heart Disease Prediction Using Machine Learning Algorithms
No ratings yet
A Study On Heart Disease Prediction Using Machine Learning Algorithms
7 pages
Heart Disease Prediction System Using Machine Learning
No ratings yet
Heart Disease Prediction System Using Machine Learning
7 pages
View of Cardiovascular Heart Disease Prediction Using Machine Learning Classifiers With Data Mining Techniques
No ratings yet
View of Cardiovascular Heart Disease Prediction Using Machine Learning Classifiers With Data Mining Techniques
9 pages
0_2nd Review Ppt
No ratings yet
0_2nd Review Ppt
31 pages
Research Paper 2023
No ratings yet
Research Paper 2023
28 pages
REVIEW 1
No ratings yet
REVIEW 1
18 pages
Heart Disease Prediction Using Machine L
No ratings yet
Heart Disease Prediction Using Machine L
7 pages
Bala
No ratings yet
Bala
28 pages
Cardiovascular Diseases Prediction Article
No ratings yet
Cardiovascular Diseases Prediction Article
28 pages
Heart Failure Prediction Using Hybrid Method
No ratings yet
Heart Failure Prediction Using Hybrid Method
8 pages
(IJCST-V9I3P22) : Yogesh Gedam, Shivraju Bomble, Uma Kurwade, Bhavana Parchake, Hemant Uike
No ratings yet
(IJCST-V9I3P22) : Yogesh Gedam, Shivraju Bomble, Uma Kurwade, Bhavana Parchake, Hemant Uike
4 pages
Clinical Decision Support System: Fundamentals and Applications
From Everand
Clinical Decision Support System: Fundamentals and Applications
Fouad Sabry
5/5 (1)
"Data Analysis" Basic Concepts and Applications
From Everand
"Data Analysis" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
2017 - 2018 Odd ICT Scheme and Syllabus
No ratings yet
2017 - 2018 Odd ICT Scheme and Syllabus
49 pages
Inter Preneur Ship
No ratings yet
Inter Preneur Ship
135 pages
English 12 Course Outline 2016-2017
No ratings yet
English 12 Course Outline 2016-2017
2 pages
Memoirs of A Student in Manila by P. Jacinto (A Pen Name of José Rizal)
No ratings yet
Memoirs of A Student in Manila by P. Jacinto (A Pen Name of José Rizal)
28 pages
CHANGES in CLIMATE
No ratings yet
CHANGES in CLIMATE
2 pages
Cross Flow Heat Exchanger (Radiator-Fan)
No ratings yet
Cross Flow Heat Exchanger (Radiator-Fan)
9 pages
Abb Parts Fiser68259959 PDF
No ratings yet
Abb Parts Fiser68259959 PDF
4 pages
Gold Rate Prediction Using Linear Regression
No ratings yet
Gold Rate Prediction Using Linear Regression
10 pages
Spotify Hits Awards v5
No ratings yet
Spotify Hits Awards v5
9 pages
Student - Chapter 4 - Catalysis
No ratings yet
Student - Chapter 4 - Catalysis
48 pages
How to Know Your Guardian Angel -2
No ratings yet
How to Know Your Guardian Angel -2
20 pages
JGP (Jurnal Geologi Pertambangan)
No ratings yet
JGP (Jurnal Geologi Pertambangan)
11 pages
Class 7 Light Notes
No ratings yet
Class 7 Light Notes
5 pages
Clarus 600-680 User's Guide - 09936780D
No ratings yet
Clarus 600-680 User's Guide - 09936780D
362 pages
Shield Tunneling
No ratings yet
Shield Tunneling
6 pages
Mixtures in Two or More Phases Are Heterogeneous Mixtures. Examples Include Ice Cubes in A Drink, Sand
No ratings yet
Mixtures in Two or More Phases Are Heterogeneous Mixtures. Examples Include Ice Cubes in A Drink, Sand
2 pages
Christ College of Engineering and Technology: Presented By, S.Nishanthi Mba
No ratings yet
Christ College of Engineering and Technology: Presented By, S.Nishanthi Mba
11 pages
Lesson- Plants Increasing the Number
No ratings yet
Lesson- Plants Increasing the Number
4 pages
Car India May
No ratings yet
Car India May
133 pages
Date Sheet for the Associate Degree in Arts Science 2 Years Program Fi5942
No ratings yet
Date Sheet for the Associate Degree in Arts Science 2 Years Program Fi5942
2 pages
Tdot Design Division Drainage Manual
No ratings yet
Tdot Design Division Drainage Manual
148 pages
[LGD-2024]Bai-tap-khoa-chuyen-tuan-15-10734961-3182024115451AM
No ratings yet
[LGD-2024]Bai-tap-khoa-chuyen-tuan-15-10734961-3182024115451AM
28 pages
General Prologue To The Canterbury Tales
No ratings yet
General Prologue To The Canterbury Tales
14 pages
Memory Segmentation of 8086
100% (1)
Memory Segmentation of 8086
14 pages
Dewalt N449437 DCV517
No ratings yet
Dewalt N449437 DCV517
48 pages
Anyload TN001 v2 Load Cell Force Transducer How It Works
No ratings yet
Anyload TN001 v2 Load Cell Force Transducer How It Works
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2022 Research

Uploaded by

2022 Research

Uploaded by

Predictive Analysis of Heart Diseases with Machine Learning Approaches, pp.

PREDICTIVE ANALYSIS OF HEART DISEASES WITH MACHINE LEARNING APPROACHES

Ramesh TR1, Umesh Kumar Lilhore2*, Poongodi M3, Sarita Simaiya4*,

Email: ramesh.rajamanickam@gmail.com1, umesh.lilhore@chitkara.edu.in2* (corresponding

2.0 LITERATURE REVIEW

Table 1. Comparison of various existing ML methods

Reference Method Key Findings Dataset Challenges

3.1 Decision Tree Classifier

𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 A(𝐷) (3)

3.1.2 Gain Ratio Value (GRV)

3.2 Naive Bayes

3.3 Random Forest

𝐺𝑖𝑛𝑖 𝐼𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − 𝐺𝑖𝑛𝑖 (5.1)

𝐺𝑖𝑛𝑖 = P12 + P22 ∗ P32 … … . . +Pn2 (5.2)

3.4 K-Nearest Neighbor

3.5 SVM Method

𝐹(𝑋𝑖) = [𝑊𝑋𝑖 + 𝑏] (6)

𝐹(𝑋𝑖) = [𝑊1 ∗ 𝑋1 + 𝑊2 ∗ 𝑋2 + 𝑏] (7)

3.6 Logistic Regression

4.0 RESULTS AND DISCUSSION

4.1 Dataset Description

Table 2. Attributes in Heart disease dataset (UCI)

S. No. Parameter Description Values

4.2 Pre-Processing of Heart Disease Dataset

Fig. 2. Correlation between Heart Diseases and Numeric Features

4.2.1 Verifying Data Distribution

Fig. 3. Distribution of data (1: Positive, 0: Negative)

Algorithm 1: Isolation Forest

4.2.2 Verifying the Asymmetry of a Distribution (Skewness)

Fig. 5. Heart disease ratio based on Chest pain level

4.2.3 Exploratory Data Assessment

Fig. 7. Analysis of data of all numeric parameters in the data set

Fig. 8. Heart disease data set parameters and their values

Fig. 9. Distribution of parameters based on restecq, exang and, slope

Fig. 10. Distribution of parameters based on restecq, exang and, slope

Fig. 11. Distribution of parameters based on sex, cp and fbs

4.3 Comparison Parameters

Here TP= (true positives), FP= (false negatives), TI = Total Instance.

4.4 Experimental Findings

4.4.1 K-Fold Cross-Validation Outcomes

Fig. 12. Accuracy % of various ML Methods

Table 3. Experimental Results for various ML Methods

4.4.2 Confusion Matrix for the Classifiers

Table 4. Sample Confusion Matrix

Class Actual Positive Actual Negative

Table 5. Confusion Matrix Results for Training (Heart Disease Dataset)

Table 6. Confusion Matrix Results for Testing (Heart Disease Dataset)

4.4.3 Experimental Results Based on Selected Features using Filter Method

Feature Name of the Feature Code Feature Score

4.4.3 Experimental results for Machine Learning methods

Fig. 13. Performance-based on selected features for various classifiers

Fig. 14. Prediction Rate % for Machine Learning Classifiers

5.0 CONCLUSION AND FUTURE WORK

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Ramesh TR1, Umesh Kumar Lilhore2, Poongodi M3, Sarita Simaiya4,