0% found this document useful (0 votes)
49 views

Breast Cancer Diagnostiic Using Machine Learning

This thesis applies machine learning algorithms to breast cancer datasets to improve early detection and diagnosis accuracy. It trains decision trees, KNN, SVM, and logistic regression models on the Wisconsin and Coimbra datasets. A comprehensive analysis compares model performance when optimizing hyperparameters and selecting features. Logistic regression and SVM achieved over 99% accuracy, exceeding literature benchmarks. The research establishes machine learning's potential for improved medical diagnostics.

Uploaded by

Akbar ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Breast Cancer Diagnostiic Using Machine Learning

This thesis applies machine learning algorithms to breast cancer datasets to improve early detection and diagnosis accuracy. It trains decision trees, KNN, SVM, and logistic regression models on the Wisconsin and Coimbra datasets. A comprehensive analysis compares model performance when optimizing hyperparameters and selecting features. Logistic regression and SVM achieved over 99% accuracy, exceeding literature benchmarks. The research establishes machine learning's potential for improved medical diagnostics.

Uploaded by

Akbar ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

BREAST CANCER DIAGNOSTIC USING MACHINE LEARNING

Applying Supervised Learning Techniques to Coimbra and Wisconsin Datasets

Lappeenranta–Lahti University of Technology LUT


Master’s Programme in Computational Engineering
2023
Vikas Kushwaha
Examiners: Associate Professor Lassi Roininen
Post-Doctoral Researcher Jyrki Savolainen
1

ABSTRACT

Lappeenranta–Lahti University of Technology LUT


LUT School of Engineering Science
Computational Engineering

Vikas Sudarshan Kushwaha

BREAST CANCER DIAGNOSTIC USING MACHINE LEARNING


Applying Supervised Learning Techniques to Coimbra and Wisconsin Datasets

Master’s thesis
2023
71 pages, 13 figures, 12 tables
Examiners:Associate Professor Lassi Roininen and Jyrki Savolainen, Post-Doctoral
Researcher

Keywords: Breast cancer diagnosis, Machine learning, Cancer detection, Predictive Modelling

Breast cancer poses a significant global health concern, with approximately 2.2 million new
cases and 700,000 deaths reported in 2020. Traditional diagnostic approaches which
predominantly depend on expert judgement, have been associated with substantial variability
in accuracy. To bridge this gap ML models are used to improve diagnostic out of which the
present research investigates the potential of specific machine learning algorithms—Decision
Trees, K-Nearest Neighbors, Support Vector Machines, and Logistic Regression—with an
overarching objective of improving early detection and enhancing the precision of breast
cancer diagnosis. The study utilizes the Breast Cancer Coimbra Dataset and the Wisconsin
Diagnostic Breast Cancer Dataset for model training and evaluation. A comprehensive
comparative analysis of these models is conducted, with a focus on optimizing hyperparameters
and distance measures to ascertain the most effective configurations. Further, the influence of
feature selection methods and Principal Component Analysis on model performance is
explored.
Logistic Regression and Support Vector Machines models demonstrated remarkable
performance, surpassing the predictive accuracy of models reported in current literature, with
accuracies reaching up to 99.42%. This research could serve as a foundation for future studies
applying machine learning models in breast cancer diagnostics, emphasizing the potential of
machine learning as a robust tool in medical diagnostics.
2

ACKNOWLEDGEMENTS

I want to express my deepest gratitude to my mom, dad, and brothers who have provided
unwavering support throughout my master's journey. Their love and reassurance during times
of doubt have provided the strength I needed to persevere. I can confidently say that this
achievement would not have been possible without them. I am immensely thankful for their
enduring faith in my abilities and for their ceaseless support.

I would also like to thank my supervisors, Jyrki Savolainen and Lassi Roininen, for their
invaluable feedback, which has greatly enhanced the quality and clarity of this work.
3

ABBREVIATIONS

BC Breast Cancer

ML Machine Learning

BCCD Breast Cancer Coimbra Dataset

WDBC Wisconsin Diagnostic Breast Cancer

KNN K-Nearest Neighbors

SVM Support Vector Machines

LR Logistic Regression

DT Decision Tree

DCIS Ductal Carcinoma in Situ

ASIR Age-Standardized Incidence Rate

HDI Human Development Index

DALY Disability-Adjusted Life Years

HER2 Human Epidermal growth factor Receptor 2

BRCA1 Breast Cancer gene 1

BRCA2 Breast Cancer gene 2

HRT Hormone Replacement Therapy

MIR Mortality-to-Incidence Ratio

PCA Principal Component Analysis

TP True Positive

TN True Negative

FP False Positive

FN False Negative
4

Table of contents

Abstract
Acknowledgments
Symbols and abbreviations

1. Introduction ................................................................................................................... 8
1.1. Background and Motivation ................................................................................... 9
1.2. Aim and research question ..................................................................................... 9
1.2.1. Significance of study .................................................................................... 10
2. Theoretical Background ............................................................................................... 11
2.1. Breast cancer ....................................................................................................... 11
2.1.1. Breast Cancer Detection and Diagnosis ........................................................ 12
2.1.2. Types of breast cancer .................................................................................. 12
2.1.3. Cause of Breast cancer ................................................................................. 13
2.1.4. Breast cancer stages ..................................................................................... 14
2.1.5. Breast Cancer: Global Patterns ..................................................................... 15
2.2. Machine learning ................................................................................................. 16
2.2.1. Machine learning methods ........................................................................... 16
2.3. Algorithms........................................................................................................... 17
2.3.1. Support vector machine ................................................................................ 17
2.3.2. K-nearest neighbour(KNN) .......................................................................... 18
2.3.3. Logistic regression ....................................................................................... 19
2.3.4. Decision tree ................................................................................................ 20
2.4. Assessing Machine Learning Models: The Key Role of Precision, Recall, F1-score,
and Accuracy ................................................................................................................... 21
3. Literature Review......................................................................................................... 22
3.1. Previous research on WDBC and BCCD dataset .................................................. 23
3.2. Investigating Machine Learning Approaches for Breast Cancer Diagnosis ........... 28
4. Data and methodology ................................................................................................. 36
4.1. Data ..................................................................................................................... 37
4.1.1. Coimbra Breast Cancer Dataset .................................................................... 37
5

4.1.2. Wisconsin Breast cancer Dataset .................................................................. 38


4.2. Experiment setup ................................................................................................. 38
4.3. Data Preparation .................................................................................................. 39
4.4. Methodology ....................................................................................................... 39
4.4.1. Hyperparameter Optimization ...................................................................... 40
4.4.2. Visualizing Feature Importance: Insight into Significant Factors for Breast
Cancer Diagnosis ......................................................................................................... 42
4.5. Justification of Machine Learning Models Used for this Evaluation ..................... 43
5. Result........................................................................................................................... 45
5.1. Performance on ML methods ............................................................................... 45
5.1.1. K-nearest Neighbor ...................................................................................... 45
5.1.2. Support Vector Machine .............................................................................. 48
5.1.3. Logistic Regression ...................................................................................... 51
5.1.4. Decision tree ................................................................................................ 54
5.2. Classifier Comparative analysis ........................................................................... 57
5.3. Comparison with present literature....................................................................... 58
6. Conclusion and Discussion........................................................................................... 61
6.1. Answering the research questions ........................................................................ 61
6.2. Verification and Validation .................................................................................. 62
6.3. Future research .................................................................................................... 63
6

Figures

Figure 1: Methodology Process: PCA, Hyperparameter Tuning, and Outcomes Visualization

Figure 2: Feature importance of BCCD dataset

Figure 3: Precision, Recall and F1 score with KNN algorithm on BCCD

Figure 4: Confusion Matrix for KNN on BCCD Dataset

Figure 5: Confusion Matrix for KNN with PCA on WDBC Dataset

Figure 6: WDBC dataset with PCA performance comparison of different SVM kernel

Figure 7: BCCD dataset performance comparison with different SVM kernel

Figure 8: Confusion Matrix for SVM on WDBC Dataset

Figure 9: Confusion Matrix for SVM on BCCD Dataset

Figure 10: Confusion Matrix for LR on BCCD Dataset

Figure 11: Confusion Matrix for LR on WDBC Dataset

Figure 12: Confusion Matrix for Decision tree on BCCD Dataset

Figure 13: Confusion Matrix for Decision tree on WDBC Dataset

Tables

Table 1: Stages of breast cancer

Table 2: Performance of Machine Learning Models on Breast Cancer Datasets Utilized in


Our Study

Table 3: Overview of Breast Cancer Studies Utilizing Machine Learning Techniques

Table 4: Overview of Patient health features BCCD dataset

Table 5: Comparison of KNN Accuracy on BCCD Dataset Using Different Distance

Table 6: LR Model Performance with different Hyperparameters


7

Table 7: Decision Tree Model Performance at Various Max Depths on BCCD Dataset

Table 8: Decision Tree Model Accuracy at Various Max Depths on WDBC Dataset

Table 9: Comparison of Machine Learning Model Accuracies on BCCD Dataset Before and
after Feature Elimination

Table 10: Model Performance: Assessing Accuracy Across Machine Learning Models on
WDBC Dataset

Table 11. Comparative Analysis of Machine Learning Model Performance with Previous
Studies
8

1. Introduction

In the last decade, there has been a surge of interest in the field of Machine learning (ML),
driven by several factors including lower prices for processing time and storage space. This
has facilitated the development of advanced ML models, such as reinforcement and deep
learning, enabling the efficient archiving, processing, and analysis of massive datasets.
Machine learning has played a crucial role in areas like data mining, natural language
processing, image recognition, expert systems, and prediction (Maity & Das, 2017).The
primary objective of this thesis is to develop an accurate and efficient machine learning model
for Breast Cacner (BC) detection.

BC is a significant global health concern, with an estimated 2.2 million new cases and 0.7
million deaths by 2020, making it the second leading cause of cancer-related fatalities among
women worldwide (Sung et al., 2021).In 2021 (Siegel et al., 2022) anticipated 43,600 deaths
between women and 0.3 million newly diagnosed cases of BC with in U.S. Typically speaking,
tumours can be either benign or malignant. Everyone faces a bigger risk of getting BC, yet it
is neither life-threatening nor malignant. Malignant tumours, on the other hand, are a more
significant cause for concern because they tend to be cancerous. According to a recent study,
twenty per cent of women with BC die from aggressive tumours (Subashini et al., 2009).

Tumour diagnosis has been a focus of previous studies. Scientists are using Machine learning
(ML) and Data Mining (DM) to anticipate BC (Abdar et al., 2020). Improved accuracy and
throughput in cancer diagnosis are possible using classifier-based prediction models built on
ML and DM. DM is a wide-ranging amalgamation of methods for mining massive, complex
datasets for previously undiscovered knowledge and insights. It has seen extensive application
in the rollout of disease prediction systems (McWilliam et al., 2016), including those for
thyroid cancer (Rasool et al., 2020), and cardiovascular disease (Park et al., 2021). Fuzzy
genetics (Bicchierai et al., 2021) and computer-assisted systems (Kim et al., 2021) have
included DM and ML approaches for BC diagnosis.
9

1.1. Background and Motivation

Cancer, a disease marked by rapid, aggressive cell division, spreads to nearby organs and
tissues. DNA aberrations trigger this devastating condition. Changes typically affect larger
DNA segments called genes. The term "cancer" encompasses various types, each characterized
by uncontrolled abnormal cell growth in one or multiple organs. These rogue cells often invade
adjacent body parts and spread to other organs. BC originates from cells in the breast,
particularly those in the inner lining of milk ducts or the milk-producing lobules. These changes
or mutations may occur spontaneously due to increasing entropy or be triggered by external
factors. Various environmental stressors contribute to these alterations, including radiation
(microwaves, gamma rays, X-rays, ultraviolet rays), chemicals found in food, water, and air,
evolution, aging of RNA and DNA (Leão et al., 2021).

Cancer research has made significant strides in recent years, with machine learning (ML)
playing a crucial role in advancing diagnostic methods. However, several researchers face
challenges with ML classifier accuracy due to the absence of fundamental methodologies.
Confusion matrices in some studies have mis predicted false negatives and true negatives,
leading to the incorrect classification of cases. Another issue arises when feature training is
combined with nonlinear classification, as the model's execution time increases exponentially
with the addition of more features, ultimately affecting diagnosis accuracy. Both data analysts
and medical professionals are deeply concerned about the model's accuracy and its time
complexity. Given the above challenges, this research is motivated by the necessity to improve
the effectiveness of BC diagnostics using machine learning the goal is to propose data mining
strategies using various machine learning models to identify the most effective ML model for
predicting BC diagnosis.

1.2. Aim and research question

The purpose of the study to employ ML algorithms to test existing prediction models for
processing a large number of tumours features and extracting relevant data for BC analysis.
Learning goals included using data mining methods to find a solid cancer categorization
10

prediction model. The forthcoming analysis is to further examine the selected machine learning
models by varying the values of their respective hyperparameters. The main aim of this thesis
is to evaluate and optimize the Machine Learning model to Improve BC diagnostic accuracy.
This research will use for ML models Decision tree (DT), K-Nearest Neighbors (KNN),
Support vector machine( SVM), and logistic regression (LR) will propose data exploratory
techniques (DET) as well as create four separate predictive models and also will find which
machine learning models will produce best performance in Breast Cancer Coimbra
Dataset(BCCD) and Wisconsin Diagnostic Breast Cancer (WDBC) .

1. How and which machine learning models are utilized to detect cancer in patient data

according to literature?

2. Among Support vector machine (SVM), K-Nearest Neighbors (KNN), Decision tree (DT),
and Logistic regression (LR), which Machine learning (ML) model demonstrates the best
prediction performance when applied to the Breast cancer Coimbra dataset (BCCD) and
Wisconsin diagnostic breast cancer dataset (WDBC)?

1.2.1. Significance of study

Traditional diagnostic methods such as mammography and biopsies are useful but time
consuming and expensive. They are not readily available in underserved areas where it is
challenging to provide and maintain such high-cost equipment. There is an increasing demand
for alternative approaches that are cost effective and efficient. Machine learning Computer
vision technologies, which has the potential to assist with this challenge in the medical field.
Because survival rates can be increased with early diagnosis, cancer prognosis and detection
are of the utmost importance. Early diagnosis has been shown to increase survival rates.
Because of their ability to detect complicated patterns in data, machine learning-based
technologies have emerged as promising solutions in this domain, potentially outperforming
older methods. The goal of this study is to improve machine learning algorithms for diagnosing
BC utilizing the BCCD and WDBC datasets, with the goal of increasing accuracy. This study
could be the benchmark for future studies in this field.
11

2. Theoretical Background

The theoretical background of this research encompasses the understanding of BC biology,


causes of BC, feature extraction and the selection and optimization of machine learning models.

2.1. Breast cancer

It is generally accepted that breast tissue is a woman's body's most common cancer site. Cancer
of the breast occurs when a large number of breast cells mutate (change) and proliferate in an
uncontrolled manner, forming a tumour (tumour). Like many other types of cancer, BC can
metastasize to a lymph node as well as other structures of the chest. It can also metastasise or
apply to other places of the body, where it can create new tumours. This is referred to as
metastasis. Aside from skin cancer, BC ranks as one of the most frequent cancers among
women. Prevalence increases after age 50 in females. Although while men are not immune to
the disease, females are more likely to contract it. About 2,600 men are diagnosed with male
BC every year in the U.S, making up less than 1% of total cancer cases. (Prague et al,, 2023).
BC is more common in trans women than in cis men. Wherein compared to cisgender women,
the incidence of BC was reduced among Trans men. Women aged 40 and up are more likely
to acquire BC, which develops when cells in the milk-producing glands (called lobules)
become aberrant and divide rapidly (de Blok et al., 2019).

Between 8 and 9 percent of women worldwide were given a BC diagnosis each year, as well
as its underlying cause is yet to be extensively identified, based on the World Health
Organization. Nonetheless, there are a number of known risk factors that are thought to enhance
the threat of getting BC in women These contain dietary habits, alcohol consumption being
female, smoking, having dense breasts, not getting enough exercise, having a history of
pregnancy, family history, genetics, breastfeeding, ethnicity, life history, menstrual history,
body mass index, breast density, breast changes, and a past history of the BC. The most shared
symptoms of BC contain dry, flaky skin just on breast or nipple, itchy skin, dimpled skin, red,
12

a change in breast size or shape, breast thickening in patches, and whole or partial swelling
(Sharma, 2021).

2.1.1. Breast Cancer Detection and Diagnosis

Computer-aided prediction and treatment of BC necessitate several intermediate steps,


including lesion identification, feature recognition, and finally, classification of detected
regions. Breast lesion detection can be achieved either by pixel-by-pixel delineation of a
suspicious region in a breast image or by creating a bounding box around the questionable area.
(Esteva et al., 2017). An alternative approach involves processing the entire image for cancer
detection instead of isolating and classifying potentially cancerous regions from breast scans,
which could entail additional costs.

During the processing, features are extracted from the entire image, with emphasis on regions
showing abnormalities or potential lesions (Antropova et al., 2017). These features encapsulate
valuable information about the structure and morphology of breast tissues, thereby offering
insights into the presence or absence of cancerous cells. The extracted features may include
shape-based, texture-based, and edge-based attributes, each of which contributes uniquely to
the overall diagnostic process. Each of these features contributes distinct information to the
diagnostic process and is essential for the precise and accurate diagnosis of BC (Dhungel et al.,
2017). Once the relevant features have been extracted, they are fed into a classifier (a machine
learning algorithm) that will predict whether the examined tissue is cancerous or not,
completing the process of computer-aided BC detection and diagnosis (Dhungel et al., 2017).

2.1.2. Types of breast cancer

BC comes in a wide variety of forms. The afflicted cell types in the breast are used to classify
the disease. Carcinomas make up the huge majority of cases of BC (American Cancer Society,
2021). Adenocarcinomas arise in the gland cells that line the milk ducts and the lobules, making
them the most prevalent type of BC (milk-producing glands). Because they begin in distinct
13

breast cells, malignancies like angiosarcoma and sarcoma can also develop in the breast but
are not technically BC (Elanany et al., 2023).

It is also possible to categorise breast tumours based on the proteins and genes they express.
Following a biopsy, BC cells are analysed to fix whether or not they have the HER2 protein or
gene, as well as the oestrogen receptor and progesterone receptor proteins (Ross et al., 2009).
The tumour cells are examined in great detail in the lab to determine the tumour’s grade.
Treatment options and cancer stage are often determined by the types of proteins detected and
the severity of the tumour.

2.1.3. Cause of Breast cancer

The multiplication and spread of aberrant breast cells are critical steps in developing BC.
However, the specialists are still unsure of the initial trigger for this phenomenon. Research
has indicated a multitude of potential risk factors that could increase a woman's susceptibility
to developing BC (Mahmood, 2023). Several potential risk factors have been identified. Age
is a critical risk factor for BC, particularly for women over the age of 55. As a woman grows
older, her breast tissue becomes more vulnerable to damage and mutation, increasing the
likelihood of cancerous growths. One prominent influence is the hormonal changes that occur
during menopause, which can potentially stimulate the development of BC. Moreover, sex is a
critical determinant in BC incidence. The disease affects women significantly more than men,
which is attributed to the greater estrogen exposure in women that stimulates the growth of
breast cells (American Cancer Society, 2021).

Family history and genetics substantially contribute to the risk factors associated with BC. An
elevated risk is observed in women whose close relatives, such as mother, sister, or child, have
been diagnosed with BC. Certain genetic mutations, notably BRCA1 and BRCA2, while
relatively rare and accounting for only 5-10% of all cases, are linked to a significant increase
in BC risk. In addition to genetic factors, several lifestyle habits are associated with increased
BC risk. Smoking and alcohol consumption have been consistently associated with higher BC
incidence rates. Obesity represents another critical risk factor, given the ability of adipose cells
to produce oestrogen, a hormone known to stimulate the growth of breast cells (Mahmood,
14

2023).Radiation exposure, particularly to the head, neck, or chest regions, can also elevate BC
risk. Such radiation has the potential to damage the DNA in breast cells, triggering mutations
and precipitating abnormal growth. HRT in which synthetic hormones substitute for naturally
occurring ones, has been implicated in elevated BC risk. Data suggest that women utilizing
HRT are at a higher risk of BC compared to those who refrain from such therapy (Mahmood,
2023).These findings highlight the multifaceted nature of BC risk, encompassing genetic,
lifestyle, and environmental factors.

2.1.4. Breast cancer stages

BC is classified into stages according to the size of the tumour and the extent to which it has
spread. Cancers that have spread beyond the breast or are particularly big are considered to be
at a more advanced stage than those still localised to the breast. It is essential for doctors to
determine whether or not a BC patient has an invasive form of the disease before deciding how
to treat it (Jia et al., 2023).How far the cancer has gone, whether or not lymph nodes are
implicated, and the size of the tumour. BC can be divided into five distinct stages, labelled
from 0 to 4 (Edge & Compton, 2010)

Table 1. Stages of breast cancer (Mahmood, 2023)

Stage Description
Stage 0 DCIS: Cancer cells have not spread beyond ducts
Stage 1A Tumor ≤ 2 cm, no lymph node involvement.
Stage 1B Lymph nodes test positive, tumor ≤ 2 cm
Stage 2A Tumor ≤ 2 cm with 1-3 lymph nodes affected OR tumor > 5 cm without spread.
Stage 3A Swelling of 4-9 4-9 axillary lymph nodes OR lymph nodes inside breast. Tumor size
doesn't matter.
Tumor > 5 cm with 1-3 lymph nodes affected (including breastbone lymph nodes).
Stage 3B Cancer spreads to chest walls, possibly affecting up to 9 lymph nodes
Stage 3C At least 10 lymph nodes affected in the armpit, under the collarbone, or in the breast.
15

Stage 4 Metastatic BC: Any size tumor, cancer cells spread to nearby and distant lymph nodes
and other organs.

2.1.5. Breast Cancer: Global Patterns

According to estimates, cancerous tumours are the biggest cause of disability in women all
over the world, accounting for 107.8 million years of potential life lost. These DALYs are
reduced because of BC, which is responsible for 19.6 million of them. The percentage of female
malignancies attributable to BC in the U.S is projected to rise to 29% by the year 2040. (Alom
et al., 2023). The HDI has a favourable and statistically significant connection with the ASIR
for BC, as shown by the most recent information from GLOBOCAN. According to the data
from the year 2020, nations with an HDI of 75.8 per 0.1 million had the highest ASIR, whilst
nations with an HDI of either medium or low had an ASIR that was less than half (27.8 as well
as 36.1, respectively). There were 0.7 million fatalities [95% UI, 0.7 million] from BC
worldwide at an estimated rate of 13.6 per 0.1 million people of reproductive age. Although
mortality rates are highest in industrialised regions, 63% of all fatalities in 2020 will be in Asia
and Africa. A woman inside a high-income country has a better chance of survival if she
receives a positive BC diagnosis, she has a much better chance of survival than in a low-income
or would in a low-income or even middle-income country (Ferlay et al.,2020).

The mortality-to-incidence ratio (MIR) for BC in the year 2020 was 0.30, which is indicative
of 5-year survival rates (Łukasiewicz et al., 2021). The five-year survival rate for BC is 89.6%
for cancer that has been localised but it is only 75.4% for cancer that has spread throughout the
body in nations with modern healthcare systems such as Hong Kong, Singapore, and Turkey.
In high-income countries, the five-year BC survival rate was 76.3%, whereas in low-income
countries, it was only 47.4%. (India, Philippines, Thailand, Costa Rica, and Saudi Arabia)
(Łukasiewicz et al., 2021).
16

2.2. Machine learning

A component of both AI and computer science, that seeks to progress upon earlier AI systems
by demonstrating how humans study and using that knowledge to generate new data and
algorithms. Over the past few decades, storage and processing improvements have made
possible novel machine learning-based products for instance, Netflix's recommendation engine
or self-driving cars (Bell, 2022). There has been a recent uptick in the popularity of data
science, of which machine learning is an integral aspect. In data mining projects, statistical
approaches are used to teach computers to sort data into categories, make predictions, and
unearth previously unknown relationships. Applications and businesses can then make better
decisions based on these findings, which should positively affect critical growth KPIs.

2.2.1. Machine learning methods

Machine learning encompasses various methods for analyzing and modeling data, with three
key aspects being semi-supervised learning, unsupervised learning, and supervised learning.

Supervised learning processes existing input data to generate output and has two main sub-
types: classification and regression (Zhou, 2018). Classification involves organizing data into
pre-defined classes, while regression is concerned with making predictions or inferences about
data characteristics using a subset of these characteristics. In contrast, unsupervised learning
does not require predetermined target outcomes. It aims to uncover the relationships and
connections within the data during the learning process. Unsupervised learning doesn't rely on
"training data" and includes clustering and association as its primary forms (Glielmo et al.,
2021). Clustering identifies related groups when data's inherent groupings are unknown, and
association discovers relationships and connections between items within the dataset.

Semi-supervised learning is employed when there is a smaller amount of labelled data


compared to unlabelled data. In such scenarios, both supervised and unsupervised learning
might be inadequate. Semi-supervised learning uses a limited amount of labelled data to derive
17

more information, which is considered learning under supervision .While unsupervised


learning can function without a labelled dataset, semi-supervised learning necessitates some
labelled data. However, the quantity of labelled data in semi-supervised learning is less than
the data to be predicted (Pise and Kulkarni, 2008).

2.3. Algorithms

In this study, four machine learning algorithms utilized to investigate BC diagnostics using the
BCCD and WDBC datasets. The selected algorithms-KNN, SVM, DT, and LR are chosen
based on their variety of approaches, interpretability, and ease of use. The following
subsections provide a detailed description of each algorithm, along with their suitability for the
current study.

2.3.1. Support vector machine

SVM is a supervised ML algorithm, commonly used for classification problems. It is effective


in high-dimensional spaces and is robust against overfitting with new data points. The SVM
finds a hyperplane that separates data points of different classes. It can handle both linearly
separable data through hard-margin SVM and non-linearly separable data through soft-margin
SVM (Hastie, Tibshirani, & Friedman, 2009). The latter also uses the kernel trick to manage
non-linear data by mapping it into a higher dimension where linear separability can be
attainable (Schölkopf & Smola, 2002).

Mathematically, a hyperplane is represented by the equation:

wx + b = 0 (1)

where w is the weight vector, x represents a data point, and b is the bias term.

In cases where the data is linearly separable, t problem for SVM is:

minimize: ||w||^2 / 2

subject to: y_i (w • x_i + b) ≥ 1, for i = 1, 2, . . . , N

where y_i denotes the class label of the data point x_i, and N is the total number of data points.
18

However, real-world data is often non-linearly separable. To accommodate such cases, the soft
margin SVM is introduced. This approach allows for some misclassifications by incorporating
slack variables (ξ_i) and a regularization parameter (C). The soft margin SVM expressed as

minimize: ||w||^2 / 2 + C ∑ ξ_i

subject to: y_i (w • x_i + b) ≥ 1 − ξ_i, and ξ_i ≥ 0, for i = 1, 2, . . . , N

The regularization parameter, C, manages the trade-off between optimizing the margin and
reducing the classification error. Smaller C value allows for a margin with more
misclassifications, whereas a larger C value imposes stricter constraints on misclassifications,
resulting in a narrower margin.

A kernel trick can be applied to handle non linear data to classify

Popular kernel functions include:

Linear kernel: K(x, y) = x • y (2)

Polynomial kernel: K(x, y) = (γx • y + r)^d (3)

Kernel functions are used in SVM to transform data into higher-dimensional spaces, which
makes it possible to find a hyperplane that separates the data in cases where the data is not
linearly separable in the original space (Schölkopf & Smola, 2002).

2.3.2. K-nearest neighbour(KNN)

The K-Nearest Neighbor (KNN) algorithm is a type of instance-based learning algorithm that
classifies a given query point based on the majority class of its 'K' nearest data points in the
feature space. This makes KNN a 'lazy' learning algorithm, as it does not build a model from
the training data but instead uses the data points themselves for prediction (Hastie, Tibshirani,
& Friedman, 2009). The KNN algorithm can classify the presence of BC based on the
proximity of other data points with similar features. To do this, KNN uses various distance
metrics such as Euclidean, Manhattan, Chebyshev, and Minkowski to calculate the distance
between data points (James, Witten, Hastie, & Tibshirani, 2013).
19

Euclidean : d(x, y) = sqrt(sum((x_i − y_i)^2) (4)

Manhattan: 𝑑(𝑥, 𝑦) = 𝑠𝑢𝑚(|𝑥_𝑖 − 𝑦_𝑖|) (5)

Minkowski: 𝑑(𝑥, 𝑦) = (𝑠𝑢𝑚(|𝑥_𝑖 − 𝑦_𝑖|^𝑝) 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖 𝑓𝑟𝑜𝑚 1 𝑡𝑜 𝑛)^(1/𝑝) (6)

The equations presented denote different distance metrics used in machine learning: the
Euclidean, Manhattan, and Minkowski distances. The symbol 'd(x,y)' signifies the distance
between two data points x and y, with 'x_i' and 'y_i' indicating individual elements of these
data points. The variable 'n' stands for the count of elements in each data point, and 'p' is a
parameter that dictates the specific distance metric applied. When 'p' equals 2, the Euclidean
distance is the result, and when 'p' equals 1, the outcome is the Manhattan distance. The
selection of an appropriate distance metric, along with the choice of the 'K' value, can greatly
influence the performance of the KNN algorithm. Additionally, it's important that KNN can be
hindered by the 'curse of dimensionality' in scenarios involving high-dimensional data.

2.3.3. Logistic regression

Logistic regression is a supervised learning algorithm used primarily for binary classification
problems. It predicts the probability of an instance, represented by the input variables,
belonging to the default class (Y=1). This prediction can be represented as a binary variable,
which, in the context of this study, translates into the likelihood of the presence or absence of
BC based on the input features from the datasets. The logistic function, also known as the
sigmoid function, maps any real-valued number into a value between 0 and 1, which can then
be interpreted as the predicted probability (James, Witten, Hastie, & Tibshirani, 2013). The
logistic function is represented as follows:

𝑃(𝑥) = (1) / (1 + 𝑒^(−𝑥)) (7)

Odd ratio is given by:

𝑂𝑑𝑑𝑠(𝑥) = 𝑃(𝑥) / (1 − 𝑃(𝑥)) (8)

This can also be expressed through the logit function:

𝑙𝑜𝑔𝑖𝑡(𝑥) = 𝑙𝑛(𝑂𝑑𝑑𝑠(𝑥)) = 𝑙𝑛(𝑃(𝑥)/ (1 − 𝑃(𝑥))) (9)


20

The coefficients (β_0, β_1,..., β_n) in the LR model are estimated from the training data using
the method of maximum likelihood (Hosmer Jr et al., 2013). The LR model can thus be
expressed as follows:

Logistic regression model

𝑃(𝑌 = 1|𝑋) = (1) / (1 + 𝑒^(−(𝛽_0 + 𝛽_1 𝑋_1 + . . . + 𝛽_𝑛 𝑋_𝑛))). (10)

This implies that the probability of Y=1, given the input variables X, can be computed by
applying the sigmoid function to the linear combination of the input features and their
respective coefficients.

2.3.4. Decision tree

The DT algorithm, utilized in both classification and regression problems, operates by


recursively dividing the data into subsets based on input feature values. The ultimate objective
is to construct a model that precisely predicts the target variable based on a specific set of input
features. Two prevalent criteria for selecting the most suitable feature to bifurcate the data are
information gain and Gini impurity (Quinlan, 1986). Calculating the information gain involves
evaluating the decrease in the entropy, or uncertainty, of the dataset using a specific feature.
This quantity is obtained by subtracting the weighted average entropy after the split from the
pre-split entropy (James, Witten, Hastie, & Tibshirani, 2013). The entropy (H) of a dataset is
formulated as:

𝐻 (𝑆) = −𝛴 [𝑝(𝑐) ∗ 𝑙𝑜𝑔2 (𝑝(𝑐))] (11)


To calculate information gain(IG),the following formula is used
𝐼𝐺(𝑆, 𝐴) = 𝐻(𝑆) − 𝛴 [(|𝑆_𝑣| / |𝑆|) ∗ 𝐻(𝑆_𝑣)] (12)

At each node in the DT, the feature providing the highest information gain is chosen as the
splitting criterion. On the other hand, Gini impurity measures how frequently a randomly
chosen instance from the dataset would be mislabelled if it were labelled randomly according
to the distribution of class labels in the dataset. The Gini impurity is computed as follows:

𝐺𝑖𝑛𝑖(𝑆) = 1 − 𝛴 [𝑝(𝑐 )2 ] (13)


21

To identify the best feature for splitting, the Gini impurity index is calculated for each feature.
This is done by considering the weighted sum of Gini impurities for each subset produced by
the split:

𝐺𝑖𝑛𝑖_𝐼𝑛𝑑𝑒𝑥(𝑆, 𝐴) = 𝛴 [(|𝑆_𝑣| / |𝑆|) ∗ 𝐺𝑖𝑛𝑖(𝑆_𝑣)]. (14)

The feature with the lowest Gini index is chosen as the splitting criterion at each node in the
DT. In these formulas, "S" symbolizes the dataset, while "c" represents each class within the
dataset. The term "p(c)" indicates the proportion of instances belonging to class "c".
Simultaneously, "A" denotes the feature under consideration for splitting the dataset into
subsets 'S_v', each containing instances that share the value 'v' for the feature 'A'. Lastly, "|S_v|"
and "|S|" represent the quantity of instances in subsets "S_v" and "S", respectively (Quinlan,
1986).

2.4. Assessing Machine Learning Models: The Key Role of Precision, Recall, F1-
score, and Accuracy

Performance analysis is undertaken to assess a model and determine its effectiveness. The
(James et al., 2013) analysis allows us to pinpoint areas of the model that require improvement,
to evaluate its efficacy, and to ensure its reliability (Hastie et al., 2009). The performance
evaluation is conducted through various indicators, including precision, recall, accuracy, and
the F1-score. Accuracy refers to the proportion of correctly classified instances relative to the
overall number of classifications. It is a beneficial metric to employ when the class distribution
is balanced i.e., there is an equal or near-equal number of samples in each class (James et al.,
2013). However, with imbalanced datasets, accuracy may not provide a comprehensive
evaluation of the model's performance, as it can be heavily influenced by the majority class.
This is due to the fact that a model could predict the majority class every time and still achieve
high accuracy, leading to misleading results (Hastie et al., 2009). Precision is an effective
measure for evaluating results when the cost of false positives is high.
22

In many cases, there is a trade-off between precision and recall, where optimizing one may
lead to the reduction of the other. To balance these metrics, the F1-score is often used, which
is the harmonic mean of precision and recall. The F1-score provides a single metric that
considers both precision and recall, making it especially useful when the costs of false positives
and false negatives are very different, not just in the case of imbalanced datasets (Saito &
Rehmsmeier, 2015).

Precision

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/((𝑇𝑃 + 𝐹𝑃) ) (15)

Recall

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 / (𝑇𝑃 + 𝐹𝑁) (16)

F-1 score

𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = 2 ∗ (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙) / (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙). (17)

Accuracy

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑃 + 𝑇𝑁) / (𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁) (18)

True Positive (TP) and True Negative (TN) are instances correctly classified by a model,
denoting actual positives and negatives, respectively. False Positive (FP) and False Negative
(FN) denote False classifications, representing negatives classified as positives and positives
classified as negatives, respectively. Choosing the appropriate performance metrics for
evaluating a machine learning model requires a detailed understanding of these measures. This
choice must also consider the unique attributes of the dataset and the specific objective of the
task.
23

3. Literature Review

The exploration of ML in breast cancer diagnosis forms a significant portion of research. A


plethora of studies have critically examined and contrasted the efficacy of diverse machine
learning models using breast cancer datasets like the WBCD and BCCD. The following
previous research paper provides a comprehensive appraisal of these investigations, focusing
on their methodology, model performance, and potential limitations. This literature review
adopts a systematic and meticulous approach. Preliminary efforts involved establishing search
terms such as 'machine learning', 'breast cancer diagnosis', 'accuracy', 'prediction', 'datasets',
and 'performance'. These terms acted as critical markers during the exploration of the vast body
of existing research. The primary databases consulted for this review were LUT Primo,
PubMed, Science Direct, IEEE Xplore, and Google Scholar. Each database was thoroughly
examined using the predefined terms, emphasizing studies published within the last decade to
ensure insights were current and relevant. After identifying potential studies of interest, each
was scrutinized for methodological consistency, data robustness, alignment with the research
question, and their contributions to the field. A structured template facilitated the extraction of
information from these studies, capturing key details such as authors, publication year,
methodology, machine learning models used, datasets employed, and reported accuracy. Each
study's impact on the knowledge base, methodological strengths and limitations, and the
implications of their findings were all critically examined. This review also encompassed
studies that signaled areas of improvement in existing models or methodologies, not just those
reporting high accuracy .

3.1. Previous research on WDBC and BCCD dataset

Various machine learning algorithms have been used for BC diagnosis, with differing results
in terms of accuracy, precision, recall, and F1-scores.The performance of these algorithms in
different demographic settings and with diverse datasets has not been fully explored. Given the
24

significance of effective BC detection and diagnosis, it is crucial to systematically review and


compare the performance of these algorithms.

A study conducted be, Al-Azzam and Shatnawi,2021 comparing the effectiveness of


supervised learning (SL) and semi-supervised learning (SSL) across nine different
classification algorithms, utilizing WDBC dataset. The researchers chose to evaluate their
models by randomly selecting 20% of the breast cancer data as a test sample. LR and K-NN
yielded the best results in both SL and SSL, with accuracy rates ranging from 97% to 98%.
These two algorithms demonstrated an impressively high prediction rate for both malignant
and benign tumors, with LR accurately predicting 100% of the malignant category for both SL
and SSL. The precision, recall, and F1-scores of LR and KNN underscored the absence of
overfitting or underfitting in these algorithms. Interestingly, the study also revealed that SSL
approaches performed comparably, and in some cases, better than their SL counterparts,
suggesting that SSL techniques can be an effective and reliable alternative for BC diagnosis,
particularly in settings with fewer data. This research is noteworthy for its comparative
approach, which contributes valuable insights into the relative performance of SL and SSL .
The impressive performance of the K-NN and LR algorithms in Al-Azzam and Shatnawi's
(2021) study informs our decision to integrate these algorithms Despite the insights gleaned,
This study is not without its limitations. The specificity of the WDBC dataset, while beneficial
for focused research, could potentially limit the generalizability of the results. As such, in our
research, we aim to utilize additional datasets, such as the BCCD dataset, to ensure a diverse
demographic representation, thereby broadening the applicability of our model.

In the comprehensive research conducted by Austria (2019), an array of machine learning


models were meticulously evaluated for their efficacy in predicting BC, using BCCD Dataset
. The research meticulously evaluated seven different machine learning algorithms, specifically
KNN, LR (employing both L1 and L2 regularization), Linear SVM (utilizing L1 and L2
regularization), Nonlinear SVM, DT, Random Forest, Gradient Boosting, and Gaussian Naive
Bayes.
25

A key finding from the study was the comparably low accuracy rates demonstrated by these
models. Among them, the Gradient Boosting classifier performed the best, albeit with an
accuracy of just 74.14%. At the lower end of the spectrum, KNN showed the least accuracy,
yielding a result of 58.14%. These relatively subdued accuracy rates were attributed by the
researchers to improper data standardization, which seemed to have a notable impact despite
the thorough hyperparameter tuning applied across all models. Ultimately, Yolanda's,2019
study serves as a significant benchmark for future research in this domain. The results
emphasize the paramount importance of proper data standardization when utilizing machine
learning models, particularly in the context of the BCCD. The study thus illuminates a clear
path forward for future endeavors, aiming to enhance the accuracy of BC prediction using
machine learning techniques.

Sharma et al. (2018) conducted an study that incorporated the k-fold cross-validation method,
specifically a 10-fold technique, to partition the data into ten distinct segments. From a total of
569 observations, 398 were utilized for the training set and the remaining 171 served as the
testing set. This division established a training to testing ratio of approximately 70:30. The
central aim of this study was to carry out a comparative exploration of three unique ML
algorithms. These were implemented on the WDBC dataset, offering valuable insights into
their utility for diagnostic purposes. Significantly, Sharma et al. (2018) reported that all the
algorithms under study achieved an accuracy of over 94% in differentiating benign from
malignant tumours. Notably, the KNN algorithm displayed superior performance in
comparison to its counterparts. It excelled in terms of accuracy, precision, and F1 score, metrics
essential for evaluating the performance of diagnostic models. However, the study observed
that the maximum accuracy achieved by any of the algorithms did not exceed 96%. This
limitation could be attributed to the use of base machine learning models without the
optimization offered by hyperparameter tuning. In comparison, Al-Azzam and Shatnawi
(2021) reported higher accuracy rates ranging from 97% to 98% for their machine learning
models. Their study highlighted the potential utility and efficiency of optimized machine
learning models in cancer diagnostics.

While LR and KNN demonstrated impressive results in Al-Azzam and Shatnawi's (2021)
study, other machine learning models, such as SVM, have also been extensively utilized in BC
26

detection. The SVM model, known for its robustness and versatility, has been explored in
different configurations and methodologies. One such methodology was proposed by (Osman,
2017) with WBC dataset. The technique consisted of applying SVM model with 10 folds
followed by two step clustering method in which data instances in dataset assigned to pre-
cluster, then the second step, clustering, takes these pre-clusters as input and applies a
clustering algorithm The disadvantage of this strategy is that clusters sometime do not represent
meaningful groupings, which might lead to bias. The study further confirmed the effectiveness
of the method by conducting a T-test which emphasized the significant improvement brought
by the two-step SVM approach. Although this research reported an accuracy of 99.10% which
is better than traditional SVM which have accuracy of 96.69%. In the research conducted by
de Brito in 2018, the application of SVM models consistently surpassed the performance of the
baseline model in all instances. This was particularly apparent when 'noise' variables, namely
adipokines related to obesity, were included in the models. The SVM models with noise
delivered better results in terms of accuracy than those without, with one notable exception
being the linear kernel model without noise, which demonstrated a superior sensitivity level
but at the cost of specificity. the SVM models show a variety of results across different kernel
types - Linear, Polynomial, and Radial. In the case of the Linear model, the accuracy improved
from 62.86% without noise to 71.43% with noise. Similar improvement was observed in the
Polynomial model. The Radial model, however, showed the highest accuracy both with and
without noise, scoring 68.57% and 77.14% respectively. The results demonstrated that while
all models performed better than chance, thus indicating the value of these variables, the study
concluded that the performance levels did not surpass other methods discussed in the existing
literature. Overall all accuracy of models were less than 80%

In the study by Durgalakshmi and Vijayakumar (2015), a new approach to automated BC


diagnosis was proposed and developed. The authors created an SVM designed to increase the
performance of the DT algorithm. The DT algorithm presented an accuracy of 82.78% when
using all features and improved to 86.34% when only selected features were considered. This
improved accuracy rate outperformed other feature selection methods like Naïve Bayes, KNN,
and random forest. The research method involved the selection of five specific features to
predict if a patient has benign or malignant stages of cancer. These five features were derived
from an original set of 32 through the correlation matrix method. Classification algorithms
were then tested using both the complete set and the selected set of features. The methodology
27

was further supplemented by the use of the J48 DT model with leaf nodes classified as either
malignant or benign. Pre-processed data was loaded onto Weka software and Information Gain
was applied to the dataset attributes. The J48 model was then implemented for DT generation
with leaf node acting as the class label. Diagnosis for new patients was determined by cross-
referencing values in DT, thereby specifying the type of tumour as benign or malignant.

Faud(2018) study examined seven different machine learning algorithms for detecting BC
using a specified dataset. Algorithms implemented included Random Forest, SVM, KNN, LR,
Gaussian Naïve Bayes, Convolutional Neural Network (CNN), and Artificial Neural Network
(ANN). Without Principal Component Analysis (PCA), deep learning models (CNN and ANN)
were found to outperform the other algorithms, with ANN achieving a perfect recall score of
100% - a critical metric in cancer detection, where false negatives can have serious
implications. Nevertheless, the precision score of the ANN model was somewhat lower,
indicating that it might produce more false positives, while CNN and ANN achieved superior
results with the given dataset, the performance drop seen with the application of PCA signals
the potential sensitivity of these models to data transformations. Additionally, the variability
in performance across different models emphasizes the importance of thoroughly testing
multiple algorithms in machine learning applications for BC detection.

Table 2.Performance of Machine Learning Models on Breast Cancer Datasets Utilized in Our
Study

Author Methodology ML model Dataset Accuracy

(Al-Azzam and PCA and t-SNE Supervised & WDBC 97% to 98%
Shatnawi,2021) unsupervised
algorithms

(Hazra et Feature DT,ANN WDBC DT:96%


al.2020) correlation
ANN:98%
28

(Austria et al., 70:30 split with 7 ML models BCCD Max accuracy: Gradient
2019) all possible boosting with 74.14%
hyperparameter

(Sharma et al., 70:30 split ratio RF,Naïve WDBC RF:94.74%


2018) with Bayes,KNN
KNN:95.90%
crossvalidation
Naïve Bayes:94.47%

(Brito, 2018) Models applied DT,LR,SVM BCCD DT:57.14%


with noise in ,LR:,62.86%,SVM:71.14
dataset %

(Faud,2018) With pca and 7 different WDBC Max:KNN-97.12%


basic svm model

(Osman,2017) Two step SVM WBC 99.10%


clustering

(Durgalakshm J48 DT SVM,KNN,RF, WDBC SVM:73%%


i and DT
DT:86.73%
Vijayakumar
,2015)

(Krishnan et Different SVM SVM WDBC 93.75%


al. 2010) kernel with 10
features

3.2. Investigating Machine Learning Approaches for Breast Cancer Diagnosis

The literature review reveals a wide spectrum of machine learning methods applied to BC
detection, spanning from SVM,KNN) to LR, Naive Bayes (NB), and Random Forests (RF).
However, the recent trend in the field leans towards deep learning techniques, especially
29

Convolutional Neural Networks (CNNs), given their superior performance. In a notable study,
Vaka et al. (2020) leveraged deep neural networks to develop a BC diagnostic algorithm,
achieving an impressive precision of 97.21%. Further enriching the landscape of deep learning
applications in BC diagnosis, Vaka et al. (2019) introduced a novel approach using a mix of
machine learning and deep learning techniques, including RCNN and Bidirectional Recurrent
Neural Networks (HA-BiRNN). Their simulation results underscored the enhanced precision,
efficiency, and image quality obtained using the DNN method. In another compelling study,
Vasundhara et al. (2019) proposed an intuitive method for classifying mammography images
as normal, benign, or malignant, utilizing several machine learning techniques. However, it
was their CNN and ANN models that outperformed the traditional machine learning
approaches, achieving accuracies of 97.3% and 99.3%, respectively. This suggests that CNN,
with its sophisticated filtering and morphological operations, is a highly effective classifier for
intuitive classification of digital mammograms.

Table 3. Overview of Breast Cancer Studies Utilizing Machine Learning Techniques

Author Title Method Findings Results Dataset


used

Dong et al., Phenotype Discovery Random Seven When combined Tabular


2022 and Geographic forest distinct with geographical
Regional Differences and phenotypes approaches, CART
in the Prognosis of regression of advanced as well as random
Advanced BC in the tree BC were forest is two
U.S: A ML Approach identified. examples of
machine learning
approaches. present
a feasible alternative
30

for future disparities


research.

Reza Rabiei Prediction of BC Random Models' Anticipate BC since Tabular


et al., 2022 using Machine Forest,Grad efficacy early discovery of
Learning Approaches ient could be this disease can slow
Boosting bolstered by the progression of
Trees the the disease and
,Multi- mammograph minimise the death
Layer ic rate by allowing for
Perceptron characteristic proper therapeutic
method s and other interventions at the
features. correct time.

Guha et al., mortality, risk factors SEERMedic The largest After a diagnosis of Tabular
2022 and Incidence of are risk of AF BC, the prevalence
atrial fibrillation for analysis occurs within of AF in women
BC: an SEER- the initial 60 increases
Medicare analysis days dramatically. BC
following a diagnosed at a more
cancer advanced stage is
diagnosis, substantially
with an correlated with AF.
annual Women with such a
incidence of new BC diagnosis
3.9%. The that develop AF have
risk of dying an increased chance
from heart of death from
31

disease and cardiovascular


other causes causes, but not from
is increased the cancer itself.
when AF
develops
after breast
cancer
treatment.

Chamieh et Evaluation of fine- Greenes Regardless Regardless of what Tabular


al., 2022 needle cytology for method and This FNAC method was
Bc: a statistical Begg test's proposed, the
analysis specificity specificity of the
were FNAC test was
consistently always greater than
higher than its sensitivity. The
that of the odds ratios were all
other on the positive side
proposed for every strategy. A
techniques high proportion of
94%. results might be
either positive or
negative, confirming
the test's high level
of discrimination.
32

Thangarajan Identification and BC gene BC detection There is a 100% Tabular


et al., 2022 validation of plasma expression sensitivity detection rate and a
biomarkers for profiling was 65% and 91% specificity for
diagnosis of BC in specificity differentiating BC
South Asian women was 80% from benign and
when six control
markers circumstances based
proteins ( affects the levels of
basic methylation in
fibroblast WIF1, DACT2, and
growth SOSTDC1 .
factor, leptin,
adipsin, synd
ecan-1,basic
fibroblast
growth
factor, dickop
ff-3, and
interleukin
17B )were
used together.

Chakravarth Multi-deep CNN Multideep PCA is used By applying fuzzing Image


y et al., based CNN in this to the deep features
2022 experimentations for experiment to of both datasets, the
early diagnosis of BC cut down on highest classification
processing accuracy was
time and the achieved, surpassing
size of all existing advanced
feature frameworks.
vectors Although applying
produced by PCA to the combined
fusion. deep features didn't
enhance
33

classification
performance, it did
expedite the
execution time,
resulting in lower
processing expenses.

Wang et al.., Employing CNN They With an accuracy Image


2022 Convolutional Neural conducted a and AUC of
Networks Trained on battery of 0.933/0.989, the
Predict Axillary statistical T2WI sequence
Lymph Node tests and used fared better than the
Metastasis to a battery of other three
Multiparametric tests to sequences in the
MRI on BC Before determine validation set.
Surgery statistical Significant
significance; improvement when
P-values as compared to
below 0.05 T1WI's 0.691 and
were deemed AUC of 0.806%.
to hold
statistical
importance.

Melekoodap Methods for detecting CNN and Using the According to Image
pattu et al., BC in mammograms texture ensemble experimental
2022 using a hybrid of featurebase approach, we evidence, the
modified CNN and d found that combined technique
textural features The approach MIAS had a improves
Journal for Ambient specificity of measurement metrics
Intelligence with 97.8% and an at each stage in a
Humanized accuracy of manner that is
Computing 98.6%, while independent of the
DDSM others.
34

scored 98.3%
and 97.9%

Gonçalves Optimization of a CNNs We use a With VGG-16, all Image


et al., 2022 convolutional neural genetic three networks
network (CNN) algorithm as achieved F1-scores
architecture for well as a higher than 0.90, a
infrared BC detection particle jump from 0.66 to
using bio-inspired swarm 0.92. In addition,
techniques. optimization they raised ResNet-
to improve 50's F1-score from
all fully 0.83 to 0.90, which
connected is an improvement
layers in over previous study.
three state-of-
the-art
convolutional
neural
networks
(CNNs):
VGG-16,
ResNet-50,
and
DenseNet-
201.

(Atrey et Analysis of Breast ML Accurate A SVM is used to Tabular


technique results analyse, and several
al., 2022) Cancer Using Machine
quality metrices,
Learning Methods and may aid in
including precision,
computer determining F1-score, and

whether a
35

malignancy accuracy are


will be calculated
malignant or

benign.

Naji.,2021 Machine Learning Random Determine SVM was shown to Tabular


Algorithms For BC forests or the best in be the most effective
Prediction And random terms of classifier, with a
Diagnosis decision confusion success rate of
forests matrix 97.2%.
accuracy and
precision.

Bhise et al., BC Detection using CNN CNN was Accuracy and Image
2021 ML Techniques method found to be precision are used as
superior to yardsticks for the
other system's efficiency.
approaches in The probabilistic
terms of outcomes have been
accuracy, predicted with the
precision, use of activation
and data set functions like ReLu.
size.

Shen et al., Using Deep Learning CNN and More datasets These results Image
2019 to Screening convolution without ROI demonstrate the
Mammography to al network annotations feasibility of training
Boost BC Detection method can be used automatic deep
to fine-tune learning algorithms
the method, to find accuracy on
even if the mammography
datasets were platforms, They
produced show promise for
36

from enhancing and


mammograph decreasing false
y platforms screening
with varying
pixel
intensity
distributions.

Naif Proteomic 2D Gel Proteins Proteins in a Image


Abdullah et Investigation of Stage- Analysis, engaged in malignant state
al., 2016 II BC using Formalin- MALDI- cellular could be crucial
Fixed Paraffin- TOF pathways stages in creating
Embedded Tissues Analysis associated database of BC
with tumour diagnostic markers
and cancer including proteome
formation, database.
including cell
cycle,
angiogenesis.

4. Data and methodology

This research utilizes WDBC dataset and BCCD dataset, sourced from the UCI Machine
Learning repository. Following the exploratory analysis, the dataset was partitioned into testing
and training sets. Four distinct predictive methods were implemented: KNN, LR, DT, and
SVM. These methods were employed to scrutinize the datasets, with classifiers utilizing
confusion matrices and other metrics to assess model efficacy. The final step involved a
comparative study of the accuracy of each model against established ones, aiming to identify
the most effective strategy.
37

4.1. Data

The WDBC and BCCD datasets were chosen for this research due to their broad applicability
in numerous research areas. ML models were trained on this binary dataset, achieving an
acceptable level of accuracy. The subsequent subsections provide a detailed rationale for the
specific selection of these datasets:

4.1.1. Coimbra Breast Cancer Dataset

The BCCD Dataset a smaller and less-analyzed dataset of patients compared to WDBC dataset,
presents an excellent (Patrício et al., 2018), opportunity to examine how ML models perform
on limited data. By exploring this dataset, to gain valuable insights into the best ML model for
BC diagnosis, contributing to the ongoing efforts to improve cancer detection and treatment.

It contains clinical data collected from patients at the University of Coimbra in Portugal. The
dataset is focused on BC diagnosis using biomarkers and clinical features, making it suitable
for ML analysis.BCCD Dataset consists of several columns, each representing different clinical
features collected from patients at the University of Coimbra in Portugal. The columns and
their descriptions are as follows:

Table 4. Overview of Patient health features BCCD dataset


Feature no. Feature Name Data Type

1 Age Integer

2 BMI Float

3 Glucose Integer

4 Insulin Float

5 HOMA Float

6 Leptin Float
38

7 Adiponectin Float

8 Resistin Float

9 MCP-1 Float

This dataset comprises nine features, each representing a unique aspect of the patient's health.
Some of these features emphasize body composition, while others concentrate on hormone
levels. The dataset includes both integer and floating-point data types for these features.

4.1.2. Wisconsin Breast cancer Dataset

The WDBC dataset, which includes data from 569 patients, was used to create the breast tumor
feature set. This dataset, disseminated by Dr. William H. Wolberg from the University of
Wisconsin-Madison's Department of General Surgery, comprises fluid samples from patients
with solid breast tumors. The cytological feature analysis during digital scans was facilitated
by a program called Xcyt (Wolberg & Mangasarian, 1990). This program employs a curve-
fitting method to calculate ten features and returns the mean, worst case, and standard error
(SE) for each. Each sample's record also includes a diagnosis—either malignant (M) or benign
(B). The dataset comprises 569 instances and 32 attributes, including ID, diagnosis, and 30
input features. The WDBC is relatively evenly divided between benign and malignant cases,
with approximately 37% malignant and 63% benign cases.

4.2. Experiment setup

Python programming language was used in jupyter notebook. It is used for data analysis and
ML tasks. Python processed the following critical stages to aid the data analyst in carrying
out this task for real-time prediction of BC:

1) Perform the pre-processing steps to remove the missing values by importing the

relevant Python libraries like NumPy, pandas, and sklearn.


39

2) Run the four data exploration methods through dataset.

3) Assess the effectiveness of the models by creating and utilizing appropriate functions,
including the ROC curve, confusion matrix, cross-validation metrics, learning curve, and the
precision-recall curve.

4.3. Data Preparation

Prior to implementation of model both BCCD and WDBC dataset were preproces. The data
were standardized to ensure that all features had the same scale. To achieve this, a
StandardScaler() function from the Scikit-Learn library was used. The standardization was
applied separately to the training and test datasets to avoid data leakage. In this case, the
StandardScaler() was fitted on the training data, and then it was used to transform both the
training and test datasets. The dataset was partitioned into training and test following an 80:20.
The data split ratio of 80:20 was chosen as it is especially crucial in preventing class imbalance
that could introduce bias into our model. Furthermore, given the relatively small size of the
BCCD dataset, which contains only 116 instances, it is of utmost importance to provide
sufficient data to train the model effectively. Ensuring that the model has ample training data
contributes to improved accuracy and predictive performance. Therefore, this partition ratio
was deemed optimal for our particular research scenario.

4.4. Methodology

The ML repository is the source from which the WDBC and BCCD datasets are collected. Pre-
processing procedures were conducted on these individual datasets ensuring cases of malignant
WDBC, cases of benign WDBC, and cases of BCCD should each be given their group in the
material that has been gathered. Based on the correlation coefficients, classify the
characteristics as either favourable, unfavourable, or random. It is essential to recognize and
eliminate any irrelevant components if one wishes to achieve productive results. After
conducting an exploratory analysis of the data set, we should then split the data into a test set
and a training set. The datasets were analyzed using four different prediction methods: KNN,
LR, DT, and SVM. Following on execution of the model, this classifier will make its forecast
40

by employing several confusion matrices and other metrics to evaluate the effectiveness of the
models. In the end, we will need to conduct some research and analysis, comparing the
precision of each model to that of well-known ones. The overall analysis process is depicted in
figure 1

Datasets Collection Data pre-processing

Exploratory Data Analysis

Feature Distribution PCA

Hyper parameters Feature elimination


optimization

Logistics regression Performance evaluation


SVM Linear and
(hyper-parameter and classifier prediction
polynomial
optimization

KNN Decision tree


Result Analysis and
Accuracy Comparison

Figure 1. Methodology Process: PCA, Hyperparameter Tuning, and Outcomes Visualization

4.4.1. Hyperparameter Optimization

GridSearch Cross-Validation (CV) represents a technique employed in ML to select the


optimal hyperparameters for a specific model. It operates by establishing a parameter grid that
is methodically examined via K-fold cross-validation. This process entails a comprehensive
exploration of various parameter combinations, conducting cross-validation for each to
ascertain the combination that yields the superior model performance. GridSearch K-fold Cross
Validation (CV) can be employed to optimize hyperparameters for prediction models, LR,
SVM and KNN.
41

The first step in applying GridSearch K-fold CV to the WDBC dataset is to define the
hyperparameters and their possible values for the selected model. For instance, if an SVM is
chosen, it is essential to explore various values for the cost parameter (C) and different kernel
types. Subsequently, The dataset is partitioned into 'K' folds of roughly equivalent size, with
care taken to maintain an approximate balance of malignant and benign instances within each
fold. For KNN, the model was optimized through experimenting with different parameters. The
parameter 'k' in KNN represents the number of neighboring points considered when making a
prediction. Adjusting 'k' can significantly impact the model's performance. If 'k' is too small,
the model may be overly sensitive to noise in the data; if 'k' is too large, the model may be
oversimplified, failing to capture important patterns. Therefore, An range of 'k' values were
explored to identify the ideal number of neighbors that would yield the highest prediction
accuracy. In terms of distance measurement, three distance metrics were evaluated: Euclidean,
Manhattan, and Chebyshev. In execution of LR model, the hyperparameter were used were
'C','penality', 'solver'. Different values of 'C' were experimented with during the tuning process
to strike the right balance. Different solvers perform well with different types of data and their
choice can significantly affect the efficiency and accuracy of the model. Liblinear solver, an
apt choice for small-scale datasets and binary categorization tasks.

The disease prediction model is trained on the different k training folds using each combination
of hyperparameters, and its performance is evaluated on the validation fold. Suitable metrics,
such as accuracy or F1 score, should be employed for this evaluation. Afterward, the average
performance across all K-folds for each hyperparameter combination is calculated. This
process enables accurate evaluation of the model's performance using that specific combination
of hyperparameters.

Once the set of hyperparameters that leads to the best average result across all K-folds is
identified, it is chosen as the ideal configuration for the model. Lastly, the prediction model is
retrained using the entire WDBC dataset with the chosen hyperparameters, creating the final,
optimized model.Utilizing GridSearch K-fold CV in with dataset facilitates the optimization of
hyperparameters for prediction models. This optimization process contributes to the
42

development of more accurate and reliable tools for distinguishing malignant and benign breast
tumours.

4.4.2. Visualizing Feature Importance: Insight into Significant Factors for Breast
Cancer Diagnosis

The RandomForestClassifier, an ensemble ML model, was judiciously chosen to discern the


significance of each feature in the dataset for predicting BC. As a robust tool, the
RandomForestClassifier, embedded within the scikit-learn library, generates multiple DT
during training and provides an aggregate prediction from these individual trees. This process
enhances the model's accuracy and reliability, surpassing that of a single DT.The
RandomForestClassifier algorithm assigns a quantitative metric to the importance of input
features by calculating the average impurity decrease throughout all trees within the forest.
This attribute of the RandomForestClassifier allows it to rank features, identifying those that
most significantly contribute to accurate BC diagnosis.

For this study, the dataset was initially loaded and prepared using the Pandas library, and the
target variable was segregated from the input features. Subsequently, a random forest classifier
was instantiated with a fixed random state to guarantee the reproducibility of the results. The
classifier was then trained with the dataset, and feature importance were calculated post-
training, relying on impurity decrease within each DT.

To provide an intuitive representation, a bar chart visualizing feature importances was


generated. This bar chart displays the mean importance of each feature, along with error bars
that denote the standard deviation of importance scores across the forest. The visualization aids
in identifying critical features in the BCCD dataset for BC diagnosis. By understanding the
relative importance of each feature, more precise predictive models can be developed, focusing
on the most crucial factors contributing to BC.

For the ensuing analysis, the derived feature importance values will serve as the foundation for
feature selection. The least significant feature, MCP-1, will be eliminated, and the impact on
the accuracy of four different ML algorithms KNN, LR, DT, and SVM will be assessed. This
approach aids in reducing the complexity of our model. Simplifying the model by eliminating
43

less significant features can improve its interpretability without substantially compromising the
model's predictive accuracy. Moreover, a feature such as MCP-1 with minimal importance can
potentially introduce noise into the model, reducing its overall predictive capability. By
eliminating MCP.1, aim to reduce this noise, thereby improving the precision and robustness
of the model's predictions. This method of analysis will help clarify the impact of selecting
different features on the effectiveness of various classification algorithms, particularly when it
comes to diagnosing BC.

Figure 2. Feature importance of BCCD dataset

4.5. Justification of Machine Learning Models Used for this Evaluation

The chosen models for the current research on BC diagnosis, namely DT, KNN, LR, and SVM,
have been meticulously selected based on their literature successes in similar contexts and their
theoretical underpinnings. Notably, each model exhibits unique advantages in handling the type
of dataset and problem domain at hand.
44

DT and K-NN, as evinced in Al-Azzam and Shatnawi's (2021) study, demonstrated remarkable
accuracy levels on the WDBC dataset. Their inherent methodologies cater to pattern detection
in a multidimensional feature space, essential in medical datasets that often exhibit complex
structures and interactions among features. DT models further provide a transparent decision-
making process, aiding clinicians in understanding the model's predictions.SVM, as used by
Osman (2017), exhibits impressive flexibility in accommodating linear and non-linear
relationships due to its kernel trick capability. The choice of kernel and the regularization
parameter 'C' allow us to manage the bias-variance trade-off, a critical aspect to prevent
overfitting and underfitting. Hence, SVM is highly suited for a dataset such as ours, where we
can leverage this flexibility to effectively model complex data patterns. LR another model we
have adopted, is fundamentally a binary classification algorithm.

The probability outputs provided by the LR model are highly interpretable, providing
meaningful predictions of a tumor being malignant or benign. LR's ability to estimate feature
coefficients also provides us with insights into each feature's influence on the prediction,
enabling the possibility of feature importance analysis. LR also includes a regularization
component that helps to avoid overfitting by penalizing large values of the parameters. The
selected libraries provide an effective execution and optimization of these models, offering an
extensive range of tools for tasks such as pre-processing of features, model training, validation,
and the evaluation of performance.
45

5. Result

5.1. Performance on ML methods

The following section meticulously scrutinizes the performance of four selected prominent ML
algorithms. To ensure fair evaluation, each algorithm is scrutinized based on uniform standards
such as accuracy, precision, sensitivity (recall), F1-score, AUC-ROC .

5.1.1. K-nearest Neighbor

In the forthcoming analysis, Performance of KNN model on BCCD is examined, focusing


on three different distance measures: Euclidean, Manhattan, and Chebyshev. hyperparameter
'k' had to be defined, which stands for the number of neighbors the model considers when
making a prediction. Various 'k' values were tested to find the optimal setting for this dataset,
and the accuracy of the model for different 'k' values and distance measures was reported. For
each distance measure, Euclidean, Manhattan, and Chebyshev, different 'k' values were
experimented to find the configuration that maximizes model accuracy. As indicated in Table
5, the maximum accuracy was obtained by the Euclidean distance measure with k=1 and k=4.
Although both configurations achieved the same accuracy, it is recommended to use 4
neighbors instead of 1, as smaller k values can be sensitive to new data, potentially leading to
inaccurate predictions due to the consideration of only the nearest neighbors meanwhile there
is a chance of overfitting also. For the Manhattan distance measure, the model's accuracy was
relatively consistent across different k values, ranging between 70.83% and 79.16, The
maximum accuracy was obtained when the value of k was chosen as either 1 or 4. similar to
the Euclidean distance measure. In the case of the Chebyshev distance measure, the model's
performance was fairly consistent, with accuracies ranging from 70.83% to 83.33%. The peak
accuracy was achieved when the value of k value was 4, 5, or 6.

Table 5. Comparison of KNN Accuracy on BCCD Dataset Using Different Distance


46

Distance K=1 K=2 K=3 K=4 K=5 K=6

Euclidean 87.50 70.83 79.16 87.50 83.33 79.16

Manhattan 79.16 75 70.83 79.16 75 79.16

Chebyshev 75 70.83 79.16 83.33 83.33 83.33

Figure 2. Precision, Recall and F1 score with KNN algorithm on BCCD

An evaluation of the KNN model's performance was conducted using precision, recall, and F1-
score metrics, as illustrated in Figure 2. When applying the Euclidean distance measure, the
model achieved a precision 90.00%, a recall 87.50%, and an F1-score of 87.30%. When the
Manhattan distance measure was used, a slight decline in model performance was noted, with
precision, recall, and F1-score values of 85.29%, 79.16%, and 78.22%, respectively. Further,
the model's performance experienced additional deterioration when the Chebyshev distance
measure was utilized, with respective precision, recall, and F1-score values of 87.50%, 83.33%,
and 82.86%. Given these results, the Euclidean distance measure facilitated the highest levels
of overall accuracy and precision, signifying its superior performance.
47

The least important feature of the BCCD dataset was identified as MCP.1. Upon the removal
of this feature from the dataset, a significant improvement in the KNN algorithm's performance
was observed. The accuracy of the KNN model, when configured with 5 neighbors and
employing the Manhattan distance metric, reached a notable 95.83%.

Turning to the WDBC dataset, the basic KNN model demonstrated an accuracy of 95.61%.
Seeking to enhance this foundational result, hyperparameter tuning was introduced, resulting
in an increased prediction accuracy of 96%. This further corroborated the impact of optimal
hyperparameter selection on the model's prediction efficacy. The final approach combined
PCA with hyperparameter tuning, producing the most substantial performance improvement
and an achieved accuracy of 96.49%. Notably, the optimal value of 'k' was determined to be
'9'. Additionally, a contrast was observed between the two datasets in terms of the most
effective distance metrics – Euclidean for the BCCD dataset and Manhattan for the WDBC
dataset. The confusion matrix underscored the model's proficient classification ability,
correctly predicting 106 out of 108 benign cases and 59 out of 63 malignant cases. Figures 3
and 4 provide further insights into the KNN model's performance by confusion matrices. The
precision, recall, and F1-score for benign predictions were 0.96, 0.98, and 0.97 respectively,
while those for malignant predictions stood at 0.97, 0.94, and 0.95, indicating a slightly
stronger performance for benign case predictions.
48

Figure 3. Confusion Matrix for KNN on BCCD Dataset

Figure 4. Confusion Matrix for KNN with PCA on WDBC Dataset

5.1.2. Support Vector Machine

During this study, we looked into the viability of using SVM that used either a linear or
polynomial kernel. For the BCCD dataset, the best parameters identified were: C=100,
degree=2, gamma='scale', and kernel='linear'. The model achieved an accuracy of 79.17%, with
a confusion matrix indicating 11 true positives, 1 false positive, 4 false negatives, and 8 true
negatives. The classification report revealed a precision of 0.73 and 0.89, recall of 0.92 and
0.67, and F1-scores of 0.81 and 0.76 for Class 1 and Class 2, respectively. The accuracy of the
SVM algorithm got up to 87.5% after eliminated the MCP.1 feature, which was the least
important feature in the BCCD dataset

With WDBC dataset,a basic SVM model was used initially which achieved an accuracy of
96%. Upon further optimization with hyperparameter tuning, the SVM model displayed a
remarkable accuracy of 98.24%. Analysis yielded the following with PCA and optimal
hyperparameters: C=0.1, degree=2, gamma='scale', and kernel='linear'. The resulting model
exhibited a higher accuracy of 99.42%. The confusion matrix displayed 108 true positives, 0
49

false positives, 1 false negative, and 62 true negatives. The classification report demonstrated
a precision of 0.99 and 1.00, recall of 1.00 and 0.98, and F1-scores of 1.00 and 0.99 for benign
and malignant classes, respectively, as portrayed in Figure 7. Figures 6 and 7 provide
comparative views of SVM performance across different kernels for the WDBC and BCCD
datasets. Finally, Figures 8 and 9 reveal the confusion matrices for the SVM models on WDBC
and BCCD datasets respectively, providing a detailed overview of the prediction capabilities
of the models.

Figure 6. WDBC dataset with PCA performance comparison of different SVM kernel
50

Figure 7.BCCD dataset performance comparison with different SVM kernal

Figure 7. Confusion Matrix for SVM on WDBC Dataset


51

Figure 8. Confusion Matrix for SVM on BCCD Dataset

5.1.3. Logistic Regression

The performance of the LR model with the chosen hyperparameters was evaluated using
accuracy, confusion matrix, and classification report. The model achieved an accuracy of
91.67% on the test set. The confusion matrix showed that 12 TP and 10 TN predictions were
made, while 0 FP and 2 FN prediction occurred. The classification report revealed that the
model achieved a precision, recall, and F1-score of 0.92 for both class 1 and class 2. The ROC
curve was plotted to visualize the trade-off between TPR and FPR at various decision
thresholds. The curve demonstrated a satisfactory level of discrimination between the two
classes. The area AUC score was calculated to be 0.9444, indicating that the LR model is
capable of differentiating between the two classes with a high degree of accuracy. The model
was then retrained after removing the 'MCP.1' feature, which was perceived to be of least
importance. Surprisingly, the performance of the model deteriorated slightly, yielding an
accuracy of 87.5%. Figure 9 displays the confusion matrix for the logistic regression (LR)
model on the BCCD dataset, effectively highlighting the model's capability to accurately
identify true positives and true negatives
52

The performance of a basic LR model was evaluated using WDBC dataset, resulting in a
compelling test accuracy of 98.25%. Subsequently, the model underwent hyperparameter
tuning which led to an enhanced accuracy of 99.42%, same accuracy obtained through PCA.
The confusion matrix revealed that there were 108 true negatives (TN), 62 true positives (TP),
1 false negative (FN), and no false positives (FP). This result indicates that the model
performed exceptionally well in identifying types of cancer cases in the WDBC dataset. The
classification report for the WDBC dataset showed a precision of 1.00 for class 0 (non-
cancerous) and 0.99 for class 1 (cancerous), and a recall of 1.00 for class 0 and 0.98 for class
1. The corresponding F1-scores were 1.00 for class 0 and 0.99 for class 1. The precision, recall,
and F1-score, assessed through both macro and weighted averages, exhibit values around 0.99
for the model. This suggests an exceptional performance and implies a high degree of reliability
in the model's predictive capability.

Table 6. LR Model Performance with different Hyperparameters

Hyperparameter Accuracy

C: 0.001, Penalty: L1, Solver: liblinear 62.56%

C: 0.001, Penalty: L2, Solver: liblinear 94.22%

C: 0.1, Penalty: L2, Solver: liblinear 97.48%

The above table represent the accuracy of validation set on WBDC dataset. The combination
of C=0.001, Penalty=L1, and Solver=liblinear achieved an accuracy of 62.56%. This
combination, however, provided the lowest accuracy among the three, likely due to the strong
regularization (as represented by the small 'C' value) that may have led to underfitting, and the
L1 penalty which might have resulted in a sparse solution. The combination of C=0.001,
Penalty=L2, and Solver=liblinear yielded a significantly higher accuracy of 94.22%. The L2
penalty, unlike the L1 penalty, does not result in a sparse model, which might explain the
improved performance despite the strong regularization. Lastly, the combination of C=0.1,
Penalty=L2, and Solver=liblinear provided the highest accuracy of 97.48%. A higher 'C' value
was used, indicating less stringent regularization. This might have permitted the model to
53

identify more intricate patterns in the data, thus resulting in improved accuracy.Figure 10
presents the confusion matrix for the LR model on the WDBC dataset, illustrating performance
of the model in distinguishing between benign and malignant cancer cases.

Figure 9. Confusion Matrix for LR on BCCD Dataset

Figure 10. Confusion Matrix for LR on WDBC Dataset


54

5.1.4. Decision tree

For BCCD dataset Upon evaluating the performance of the optimized model, it was found to
achieve an overall accuracy of 75% on the test data. The confusion matrix revealed that out of
24 test samples, the model correctly classified 18 samples, while misclassifying 6 samples,
yielding a balanced outcome between the two classes. Both precision and recall metrics for the
two classes were also 75%, which is in line with the overall accuracy. The F1-score, which is
the harmonic average of precision and recall, was also 75% for both categories, demonstrating
that the classification performance was well-balanced. Figure 11 showcases the confusion
matrix for the Decision Tree (DT) model when applied to the BCCD dataset

Table 7. DT Model Performance at Various Max Depths on BCCD Dataset

Max depth Accuracy

1 75%

2 66%

3 71%

Results highlight the importance of hyperparameter tuning in ML models. While it might seem
that increasing the complexity of a model (in this case, increasing the 'Max Depth') would lead
to better performance, this is not always the case. In fact, models with too much complexity
can suffer from overfitting, leading to poorer performance on unseen data. The accuracy of the
algorithm got up to 83.33% after eliminating the MCP-1 feature, which was the least important
feature.

In our study utilizing the WDBC dataset, we found that the accuracy of the basic DT is
94.72%. Upon applying hyperparameter tuning, the accuracy remained consistent. The
utilization of PCA on the WDBC dataset effectively reduced its complexity. PCA transformed
the original dataset, comprising 30 features, into a new dataset characterized by a smaller set
55

of features, known as principal components. These principal components accounted for the vast
majority of the variability present in the original dataset. As a consequence, the complexity of
the DT model was notably diminished, thereby enhancing its performance when applied to the
reduced set of features , when PCA was employed in conjunction with hyperparameter tuning,
the model's accuracy significantly improved to 97.37% on the test dataset. This high accuracy
suggests that the model can successfully predict cancer diagnoses in approximately 97.37% of
the test cases, demonstrating its proficiency in discerning whether a breast mass is malignant
or benign. Upon examining the confusion matrix, it was evident that the model correctly
classified 69 benign instances as benign (TP) and 42 malignant instances as malignant (TN).
However, the model incorrectly classified 2 benign instances as malignant (FP) and 1
malignant instance as benign (FN). The model showcased exceptional performance in terms of
precision, recall, and f1-score for the benign (0) class, with scores of 0.99, 0.97, and 0.98
respectively, while the malignant (1) class scored 0.95, 0.98, and 0.97, respectively. These
metrics point towards a high-performing model, especially for benign cases, with a slight
underperformance for malignant cases. Further exploration of the 'max_depth' hyperparameter
revealed that at a max_depth of 1, the model achieved an accuracy of 96.49%. Increasing the
max_depth to 2 caused a slight decrease in accuracy to 95.61%. However, setting the
max_depth to 3 improved the model's performance, pushing the accuracy up to 97.37%. This
indicates that a max_depth of 3 enables the model to better capture the patterns in the data,
without causing overfitting, leading to improved performance on the test data. Figure 12
presents the confusion matrix for the DT model on the WDBC dataset. Despite the model
misclassifying a few instances, its overall performance is commendable, correctly predicting a
significant majority of both benign and malignant cases

Table 8. DT Model Accuracy at Various Max Depths on WDBC Dataset

Max depth Accuracy

1 96.49%

2 95.61%

3 97.37%
56

Figure 11. Confusion Matrix for DT on BCCD Dataset

Figure 12. Confusion Matrix for DT on WDBC Dataset


57

5.2. Classifier Comparative analysis

On the BCCD dataset, all models performed moderately well, but after removing the least
important feature, the accuracy of all models increased significantly, with the exception of LR,
which had a fall in accuracy. KNN has seen a significant increase from 87.50% to 95.83%.
Accuracy of the SVM and DT models increased from 79.17% to 87.5% and 75% to 83.33%,
respectively. With 91.62% accuracy, LR proved to be the best model. When the least important
feature was removed, KNN performed best with 95.85%.The results are presented in table 9
for BCCD dataset.

The results from the analysis of WDBC dataset suggest that all four ML models KNN, SVM,
LR, and DT - performed exceptionally well in terms of their predictive accuracies. However,
upon hyperparameter tuning and PCA, it was evident that some models outperformed the
others. Both the SVM and LR algorithms showed notable performance improvements, reaching
an accuracy of 99.42%. The accuracy of the DT algorithm also improved significantly after the
incorporation of PCA, attaining an accuracy of 97.37%. Meanwhile, KNN showed the least
improvement, yet maintained a commendable accuracy rate of 96.49%.

Moreover, the recall, precision, and F1-score also witnessed improvement with the fine-tuning
of these models, indicating their ability to deliver reliable results while minimizing errors. It
was also observed that adjusting hyperparameters such as 'C' and 'max_depth' and
implementing feature reduction techniques like PCA significantly boosted the predictive power
of the models. The detailed performance of each model provides essential insights into the
factors that contribute to an accurate prediction of BC diagnoses. These findings underscore
the potential of ML algorithms to aid in the early detection and diagnosis of BC, thereby
enhancing patient care and outcomes.
58

Table 9. Comparison of Machine Learning Model Accuracies on BCCD Dataset Before and
After Feature Elimination

ML models Accuracy Accuracy After Feature


Elimination

KNN 87.50% 95.85%

SVM 79.17% 87.5%

DT 75% 83.33%

LR 91.67% 87.5%

Table 10. Model Performance: Assessing Accuracy Across Machine Learning Models on
WDBC Dataset

ML models Accuracy (Basic Accuracy(hyperparameter Accuracy(PCA


Model) tuning) &Hyperparameter
Tuning)

KNN 95.61% 96% 96.49%

SVM 96% 98.29% 99.42%

DT 94.72% 94.72% 97.37

LR 98.25% 99.42% 99.42%

5.3. Comparison with present literature

According to the literature, Austria et al., 2019 7 ml model was applied on BCCD dataset which
obtained a maximum accuracy of 74% with gradient boosting and other algorithms such as
KNN have obtained accuracy of 58.14% while other models in the similar range this is due to
the data was not standardized lead to the least accuracy meanwhile our research has used the
different model with standardised data so our research has seen the accuracy of our model has
increased to 87.50% for KNN. Other models LR performed 91.67% which was 72.14%
59

accuracy in the previous studies meanwhile SVM,DT also perform better compared to Austria
studies. Sharma et al.2018 achieved an accuracy of 95.90% for the KNN algorithm, which was
the same as our basic KNN applied on the WDBC dataset but with PCA and hypertunning
96.49%. However, while Sharma et al. applied a split ratio of 70:30, our model achieved an
accuracy on an 80:20 split ratio. While the author used rf and naive bayes, the accuracy level
was 94%-95%.

Faud ,2018 used techniques with and without PCA on the WDBC dataset, and the greatest
accuracy attained by LR was 97.68%, with our LR model performing one of the best with
99.42% and better on many stages such as precision, recall, and fi-score. In contrast to model
tested here, the accuracy of ml models with PCA in faud, 2018 article declines in the majority
of models. LR accuracy decreases to 96.54%.In comparison Our model outperforms models
without PCA. This is due to the hyperparameter tuning that was applied to our model.

Hazra et al. (2020) applied the DT and ANN algorithms to the WDBC dataset and achieved a
maximum accuracy of 98% with ANN and an accuracy of 86% without eliminating highly
correlated features and of 86% with DT model. After eliminating highly correlated features
accuracy increases to96%.In comparison to the hazra study, our research used PCA on DT and
attained an accuracy of 97.37% yielding a higher accuracy.

In another study, Durgalakshmi et al.'s research with the present study, several important
distinctions and similarities come to the fore. Both studies utilized machine learning algorithms
to predict breast cancer diagnosis, yet the choice and usage of datasets differed significantly.
Durgalakshmi et al. used the WDBC dataset reported that the highest accuracy of 94% was
obtained using selected features in the Naïve Bayes (NB) and Random Forest (RF)
models.When selected features were used with the SVM model, the accuracy obtained was
81%.
60

Table 11. Comparative Analysis of Machine Learning Model Performance with Previous
Studies

Reference Dataset Methodology(Previous Model Model Model


Used Study) used Accuracy Accuracy
(previous study) (Proposed
Method)

Hazra et WDBC PCA transformation DT 96% 97.37%


al.2020

Austria et BCCD Standard ML model KNN 58.14% 87.50%


al.,2019

Austria et BCCD Standard ML model LR 72.14% 91.67%%


al.,2019

Sharma et al., WDBC Standard ML mode KNN 97.68% 99.42%


2018

Faud,2018 WDBC ML model with PCA LR 96.54% 99.42%

Faud,2018 WDBC Without PCA SVM 97.05% 99.42%

(Durgalakshmi J48 DT SVM, KNN, RF, DT WDBC SVM:73%% -


and
DT:86.73%
Vijayakumar
,2015)
61

6. Conclusion and Discussion

Breast cancer represents a leading cause of mortality worldwide. Early detection stands as a
crucial element in mitigating this death rate, accentuating the need for efficient and reliable
diagnostic tools. The advent of machine learning techniques offers immense promise in this
regard, paving the way for significant advancements in the early detection of breast cancer.
The study explored the use of ML techniques for BC diagnosis with two distinct datasets:
BCCD and WDBC. The models utilized included KNN, SVM, LR, and DT classifiers, which
were refined through hyperparameter tuning and feature selection to enhance performance.

In the BCCD dataset, the KNN model showed an accuracy of 87.5% when using the Euclidean
distance measure. Interestingly, the accuracy increased to 95.83% upon removal of the least
impactful feature (MCP.1). Similarly, the performance of both SVM and DT models improved
with the removal of this feature. Despite the LR model's performance decreasing after this
feature removal, it had initially achieved an accuracy of 91.62%.In the WDBC dataset, LR
model performed most effectively, achieving an accuracy of 99.42% with optimal
hyperparameters. The DT model's performance improved noticeably when combined with
PCA, with an accuracy increase from 94.72% to 97.37% at a max_depth of 3. The SVM model
initially achieved an accuracy of 96% that increased to 98.24% after hyperparameter tuning.
Notably, combining PCA with hyperparameter tuning further improved the SVM model's
accuracy to 99.42%.

6.1. Answering the research questions

1. How and which machine learning models are utilized to detect cancer in patient data
according to literature?

A significant number of ML models have been deployed for the detection of BC in patient
data, as highlighted in the literature review above. Different algorithms have shown varying
degrees of success and applicability based on the specific datasets used and the nature of the
62

pre-processing, feature selection, and optimization techniques applied. In particular, LR and


KNN models have shown impressive results in studies such as the one conducted by Al-Azzam
and Shatnawi (2021), using WDBC dataset. These models achieved accuracy rates ranging
from 97% to 98% in both supervised and semi-supervised learning settings. SVM models,
renowned for their versatility and robustness, have also been extensively employed in breast
cancer detection. Osman (2017) utilized an SVM model with a two-step clustering method to
attain an impressive accuracy of 99.10% on the WBC dataset. Meanwhile, Durgalakshmi and
Vijayakumar (2015) improved the accuracy of the DT algorithm to 86.34% by using an
optimization function-based SVM. More complex models such as Random Forest, Gradient
Boosting, Naive Bayes, Convolutional Neural Network, and Artificial Neural Network have
also been examined. In the study by Sharma et al. (2018), RF, KNN, and Naive Bayes
algorithms all achieved accuracy over 94% on the WDBC dataset. On the other hand, Austria
et al. (2019) found Gradient Boosting to be the most accurate (74.14%) among seven ML
models on the BCCD dataset.

2. Among SVM, KNN, DT, and LR, which ML model demonstrates the best prediction
performance when applied to the BCCD and WDBC dataset?

Based on the results of this study, different models showed the best performance on the two
datasets. For the BCCD dataset, after removing the least importance feature, KNN
demonstrated predictive accuracy of 95.85%. For the WDBC dataset, both SVM and LR
showed an outstanding performance with an accuracy of 99.42% after hyperparameter tuning
and the implementation of PCA.

6.2. Verification and Validation

This study, while providing significant insights, is not without its limitations. One such
limitation pertains to the size and nature of the dataset utilized. The scope and generalizability
of the findings may be influenced by these factors, as a larger and more diverse dataset with
outliers could not necessarily guarantee highly accurate results. This raises the question of how
63

well our findings might extend to other populations or scenarios, and underscores the need for
future research to replicate and extend these results with more expansive and varied datasets.
In this study, machine learning models have been successfully employed to diagnose cancer
with commendable accuracy. The study focused on four main machine learning models: KNN,
SVM, LR, and DT. However, there are many more advanced models that could potentially
improve predictive performance.

Moreover, although the study has made great strides in maximizing accuracy, it's important to
acknowledge that in the realm of medical diagnosis, accuracy is not the only performance
metric that matters. False negatives often have more detrimental effects than false positives,
highlighting the need to consider other performance metrics. This broader view on performance
measures would provide a more holistic evaluation of the effectiveness of ML models in
diagnosis. This research has made significant strides in the application of ML models for
diagnosis, while also identifying important areas for future study.

The objective of employing machine learning models to diagnose cancer with the best possible
level of accuracy was successfully accomplished. However, there could have been more
successful in determining the specific reason for the presence of malignant or benign traits.
This requires the assistance of an expert in the relevant field.

6.3. Future research

While the research has provided valuable insights using four specific machine learning
algorithms, there is an opportunity to expand the scope of future studies to explore other
potential algorithms. These could include, but are not limited to, Random Forests, Neural
Networks, or ensemble methods. Broadening the range of algorithms studied could lead to the
discovery of more effective prediction models, further improving the diagnosis of cancer.

Runtime is a critical consideration for ML models; understanding the minimum and maximum
execution times for each model is crucial. Many machine learning models don't indicate the
64

level of uncertainty associated with their predictions, an aspect that could be vital in medical
diagnostics. Future research could therefore focus on developing models that quantify this
uncertainty in addition to making predictions. The importance of the feature extraction process
in building effective machine learning models is a well-acknowledged facet of data-driven
research. While the current thesis primarily focuses on employing existing feature sets,
innovative approaches to feature extraction could be a significant area for future research.

The use of American databases in this study has raised the issue of data representation and the
applicability of the results across different ethnic groups. Future research could rectify this by
incorporating a more diverse range of datasets, including those from Asian populations. This
would enable the development of models that can accurately diagnose cancer in a wider range
of patients, enhancing the generalizability and cross-cultural applicability of the results.

By exploring these aspects, future research can contribute valuable insights into the
practicalities of integrating machine learning models into clinical workflows, highlighting
potential benefits and challenges, and ultimately guiding the way towards more effective,
efficient, and personalized healthcare.
65

References

Abdullah Al-Dhabi, N., Srigopalram, S., Ilavenil, S., Kim, Y.O., Agastian, P., Baaru, R.,
Balamurugan, K., Choi, K.C. and Valan Arasu, M., 2016. Proteomic analysis of stage-II breast
cancer from formalin-fixed paraffin-embedded tissues. BioMed research international, 2016

Al-Azzam, N., & Shatnawi, I. (2021). Comparing supervised and semi-supervised Machine
Learning Models on Diagnosing Breast Cancer. Annals of medicine and surgery (2012), 62,
53–64

Alom, M.Z., Yakopcic, C., Nasrin, M.S., Taha, T.M. and Asari, V.K., 2019. Breast cancer
classification from histopathological images with inception recurrent residual convolutional
neural network. Journal of digital imaging, 32, pp.605-617

Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric


regression. The American Statistician, 46(3), 175-185

Amethiya, Y., Pipariya, P., Patesl, S. and Shah, M., 2022. Comparative analysis of breast
cancer detection using machine learning and biosensors. Intelligent Medicine, 2(2), pp.69-81

Antropova, Huynh, B. Q., & Giger, M. L. (2017). A deep feature fusion methodology for breast
cancer diagnosis demonstrated on three imaging modality datasets. Medical Physics
(Lancaster), 44(10), 5162–5171. https://doi.org/10.1002/mp.12453

Atrey, A., Narayan, N., Vijh, S. and Kumar, S., 2022, January. Analysis of Breast Cancer using
Machine Learning Methods. In 2022 12th International Conference on Cloud Computing, Data
Science & Engineering (Confluence) (pp. 258-261). IEEE

Bell, J., 2022. What is machine learning?. Machine Learning and the City: Applications in
Architecture and Urban Design, pp.207-216

Bhise, S., Gadekar, S., Gaur, A.S., Bepari, S. and Deepmala Kale, D.S.A., 2021. Breast cancer
detection using machine learning techniques. Int. J. Eng. Res. Technol, 10(7)

Bicchierai, G., Di Naro, F., De Benedetto, D., Cozzi, D., Pradella, S., Miele, V. and Nori, J.,
2021. A review of breast imaging for timely diagnosis of disease. International Journal of
Environmental Research and Public Health, 18(11), p.5509
66

Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions
on Information Theory, 13(1), 21-27

De Blok, C. J. M., Wiepjes, C. M., Nota, N. M., van Engelen, K., Adank, M. A., Dreijerink, K.
M. A., Barbé, E., Konings, I. R. H. M., & den Heijer, M. (2019). Breast cancer risk in
transgender people receiving hormone treatment: nationwide cohort study in the Netherlands.
BMJ (Clinical research ed.), 365, l1652. https://doi.org/10.1136/bmj.l1652

De Brito, P. M. (2018). Predicting the occurrence of breast cancer using insulin-related


biomarkers, independently of obesity. Tilburg: Tilburg University

Dhungel, Carneiro, G., & Bradley, A. P. (2017). Fully automated classification of


mammograms using deep residual neural networks. 2017 IEEE 14th International Symposium
on Biomedical Imaging (ISBI 2017), 310–314. https://doi.org/10.1109/ISBI.2017.7950526

Dong, W., Bensken, W.P., Kim, U., Rose, J., Berger, N.A. and Koroukian, S.M., 2022.
Phenotype discovery and geographic disparities of late-stage breast cancer diagnosis across US
Counties: a machine learning approach. Cancer Epidemiology, Biomarkers &
Prevention, 31(1), pp.66-76

Durgalakshmi, B., & Vijayakumar, V. (2015). Progonosis and modelling of breast cancer and
its growth novel naïve bayes. Procedia Computer Science, 50, 551-553

Edge, S. B., & Compton, C. C. (2010). The American Joint Committee on Cancer: the 7th
edition of the AJCC cancer staging manual and the future of TNM. Annals of Surgical
Oncology, 17(6), 1471-147

El Chamieh, C., Vielh, P. and Chevret, S., 2022. Statistical methods for evaluating the fine
needle aspiration cytology procedure in breast cancer diagnosis. BMC Medical Research
Methodology, 22(1), p.40

Elanany, M.A., Osman, E.E.A., Gedawy, E.M. and Abou-Seri, S.M., 2023. Design and
synthesis of novel cytotoxic fluoroquinolone analogs through topoisomerase inhibition, cell
cycle arrest, and apoptosis. Scientific Reports, 13(1), p.4144

Esteva, Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017).
Dermatologist-level classification of skin cancer with deep neural networks. Nature (London),
542(7639), 115–118. https://doi.org/10.1038/nature21056
67

Ferlay, J., Ervik, M., Lam, F., Colombet, M., Mery, L., Piñeros, M., Znaor, A., Soerjomataram,
I. and Bray, F., 2020. Global cancer observatory: cancer today. International Agency for
Research on Cancer, Lyon

Fuad, W. M. (2018). Early detection of breast cancer using machine learning (Doctoral
dissertation, Brac University)

Glielmo, A., Husic, B.E., Rodriguez, A., Clementi, C., Noé, F. and Laio, A., 2021.
Unsupervised learning methods for molecular simulation data. Chemical Reviews, 121(16),
pp.9722-9758

Gonçalves, C.B., Souza, J.R. and Fernandes, H., 2022. CNN architecture optimization using
bio-inspired algorithms for breast cancer detection in infrared images. Computers in Biology
and Medicine, 142, p.105205

Guha, A., Fradley, M.G., Dent, S.F., Weintraub, N.L., Lustberg, M.B., Alonso, A. and
Addison, D., 2022. Incidence, risk factors, and mortality of atrial fibrillation in breast cancer:
a SEER-Medicare analysis. European heart journal, 43(4), pp.300-312

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data
mining, inference, and prediction. Springer Science & Business Media

Hastie, Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning Data
Mining, Inference, and Prediction (Second.). Springer New York. https://doi.org/10.1007/978-
0-387-84858-7

Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol.
398). John Wiley & Sons

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical
learning. Springer

James, Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning:
With Applications in R. Springer.

Jia, K.Y., Menes, T.S., Bernstein-Molho, R., Nissan, A. and Zippel, D., 2023. Characterization
of patients with a diagnosis of breast cancer and melanoma: genetic susceptibility or increased
surveillance?. European Journal of Cancer Prevention: the Official Journal of the European
Cancer Prevention Organisation (ECP).
68

Kim, J., Kim, J.Y., Lee, H.B., Lee, Y.J., Seong, M.K., Paik, N., Park, W.C., Park, S., Jung,
S.P., Bae, S.Y. and Korean Breast Cancer Society, 2020. Characteristics and prognosis of 17
special histologic subtypes of invasive breast cancers according to World Health Organization
classification: comparative analysis to invasive carcinoma of no special type. Breast Cancer
Research and Treatment, 184, pp.527-542

Krishnan, M. M. R., Banerjee, S., Chakraborty, C., Chakraborty, C., & Ray, A. K. (2010).
Statistical analysis of mammographic features and its classification using support vector
machine. Expert Systems with Applications, 37(1), 470-478

Leão, D.C.M.R., Pereira, E.R., Pérez-Marfil, M.N., Silva, R.M.C.R.A., Mendonça, A.B.,
Rocha, R.C.N.P. and García-Caro, M.P., 2021. The importance of spirituality for women facing
breast cancer diagnosis: a qualitative study. International journal of environmental research and
public health, 18(12), p.6415

Łukasiewicz, S., Czeczelewski, M., Forma, A., Baj, J., Sitarz, R., & Stanisławek, A. (2021).
Breast Cancer-Epidemiology, Risk Factors, Classification, Prognostic Markers, and Current
Treatment Strategies-An Updated Review. Cancers, 13(17), 4287.
https://doi.org/10.3390/cancers13174287

Mahmood, F.M., 2023. A comparison of rural and urbans women's knowledge and attitudes
toward breast cancer. Journal of Population Therapeutics and Clinical Pharmacology, 30(3),
pp.515-521

McWilliam, A., Faivre-Finn, C., Kennedy, J., Kershaw, L. and Van Herk, M.B., 2016. Data
mining identifies the base of the heart as a dose-sensitive region affecting survival in lung
cancer patients. International Journal of Radiation Oncology, Biology, Physics, 96(2), pp. S48-
S49

Mushtaq, Z., Yaqub, A., Sani, S., & Khalid, A. (2020). Effective K-nearest neighbor
classifications for Wisconsin breast cancer data sets. Journal of the Chinese Institute of
Engineers, 43(1), 80-92

Naji, M.A., El Filali, S., Aarika, K., Benlahmar, E.H., Abdelouhahid, R.A. and Debauche, O.,
2021. Machine learning algorithms for breast cancer prediction and diagnosis. Procedia
Computer Science, 191, pp.487-492
69

Osman, A.H., 2017. An enhanced breast cancer diagnosis scheme based on two-step-SVM
technique. International Journal of Advanced Computer Science and Applications, 8(4)

Park, E.Y., Yi, M., Kim, H.S. and Kim, H., 2021. A decision tree model for breast
reconstruction of women with breast cancer: a mixed method approach. International Journal
of Environmental Research and Public Health, 18(7), p.3579

Park, K.H., Batbaatar, E., Piao, Y., Theera-Umpon, N. and Ryu, K.H., 2021. Deep learning
feature extraction approach for hematopoietic cancer subtype classification. International
Journal of Environmental Research and Public Health, 18(4), p.2197

Patrício, M., Pereira, J., Crisóstomo, J., Matafome, P., Seiça, R., Caramelo, F., & Gomes, M.
(2018). Breast Cancer Coimbra Dataset [Data file]. Faculty of Medicine of the University of
Coimbra and University Hospital Centre of Coimbra. UCI Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra

Pise, N.N. and Kulkarni, P., 2008, December. A survey of semi-supervised learning methods.
In 2008 International conference on computational intelligence and security (Vol. 2, pp. 30-
34). IEEE

Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1, 81-106

Rabiei, R., Ayyoubzadeh, S.M., Sohrabei, S., Esmaeili, M. and Atashi, A., 2022. Prediction of
breast cancer using machine learning approaches. Journal of Biomedical Physics &
Engineering, 12(3), p.297

Rasool, A., Tao, R., Kashif, K., Khan, W., Agbedanu, P. and Choudhry, N., 2020, February.
Statistic Solution for Machine Learning to Analyze Heart Disease Data. In Proceedings of the
2020 12th International Conference on Machine Learning and Computing (pp. 134-139)

Ross, Slodkowska, E. A., Symmans, W. F., Pusztai, L., Ravdin, P. M., & Hortobagyi, G. N.
(2009). The HER‐2 Receptor and Breast Cancer: Ten Years of Targeted Anti–HER‐2 Therapy
and Personalized Medicine. The Oncologist (Dayton, Ohio), 14(4), 320–368.
https://doi.org/10.1634/theoncologist.2008-0230

Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the
ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3),
e0118432
70

Sannasi Chakravarthy, S. R., N. Bharanidharan, and Harikumar Rajaguru. "Multi-deep CNN


based experimentations for early diagnosis of breast cancer." IETE Journal of Research (2022):
1-16

Schölkopf, & Smola, A. J. (2001). Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press

Sharma, R., 2021. Global, regional, national burden of breast cancer in 185 countries: evidence
from GLOBOCAN 2018. Breast Cancer Research and Treatment, 187, pp.557-567

Sharma, S., Aggarwal, A., & Choudhury, T. (2018, December). Breast cancer detection using
machine learning algorithms. In 2018 International conference on computational techniques,
electronics and mechanical systems (CTEMS) (pp. 114-118). IEEE

Shen, L., Margolies, L.R., Rothstein, J.H., Fluder, E., McBride, R. and Sieh, W., 2019. Deep
learning to improve breast cancer detection on screening mammography. Scientific
reports, 9(1), p.12495

Siegel, R.L., Miller, K.D., Fuchs, H.E. and Jemal, A., 2022. Cancer statistics, 2022. CA: a
cancer journal for clinicians, 72(1), pp.7-33

Subashini, T.S., Ramalingam, V. and Palanivel, S., 2009. Breast mass classification based on
cytological patterns using RBFNN and SVM. Expert Systems with Applications, 36(3),
pp.5284-5290

Sung, H., Ferlay, J., Siegel, R.L., Laversanne, M., Soerjomataram, I., Jemal, A. and Bray, F.,
2021. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality
worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians, 71(3), pp.209-
249

Sutton, R.S. and Barto, A.G., 2018. Reinforcement learning: An introduction. MIT press.

Vaka, A.R., Soni, B. and Reddy, S., 2020. Breast cancer detection by leveraging Machine
Learning. Ict Express, 6(4), pp.320-324

Vaka, A.R., Soni, B. and Reddy, S., 2020. Breast cancer detection by leveraging Machine
Learning. Ict Express, 6(4), pp.320-324

Vasundhara, S., Kiranmayee, B.V. and Suresh, C., 2019. Machine learning approach for breast
cancer prediction. International Journal of Recent Technology and Engineering (IJRTE), 8(1).
71

Wang, Z., Sun, H., Li, J., Chen, J., Meng, F., Li, H., Han, L., Zhou, S. and Yu, T., 2022.
Preoperative prediction of axillary lymph node metastasis in breast cancer using CNN based
on multiparametric MRI. Journal of Magnetic Resonance Imaging, 56(3), pp.700-709

Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for
medical diagnosis applied to breast cytology. Proceedings of the National Academy of
Sciences, 87(23), 9193-9196

Zhou, Z.H., 2018. A brief introduction to weakly supervised learning. National science
review, 5(1), pp.44-53

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy