0% found this document useful (0 votes)

19 views

Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer

This article compares machine learning algorithms to predict breast cancer diagnosis using the Wisconsin Diagnostic Breast Cancer dataset. It reduces the dataset's 32 features using statistical feature selection and applies ensemble techniques to integrated models on the reduced subset. It evaluates classification test accuracy, standardized accuracy, and runtime to determine the best performing algorithms.

Uploaded by

Akashi Dogey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer

Uploaded by

Akashi Dogey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

SN Computer Science (2020) 1:270

https://doi.org/10.1007/s42979-020-00296-8

ORIGINAL RESEARCH

Applications of Machine Learning Techniques to Predict Diagnostic

Breast Cancer
Vikas Chaurasia1 · Saurabh Pal1

Received: 29 July 2020 / Accepted: 8 August 2020

Abstract
This article compares six machine learning (ML) algorithms: Classification and Regression Tree (CART), Support Vector
Machine (SVM), Naïve Bayes (NB), K-Nearest Neighbors (KNN), Linear Regression (LR) and Multilayer Perceptron (MLP)
on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset by estimating their classification test accuracy, standardized
data accuracy and runtime analysis. The main objective of this study is to improve the accuracy of prediction using a new
statistical method of feature selection. The data set has 32 features, which are reduced using statistical techniques (mode),
and the same measurements as above are applied for comparative studies. In the reduced attribute data subset (12 features),
we applied 6 integrated models AdaBoost (AB), Gradient Boosting Classifier (GBC), Random Forest (RF), Extra Tree (ET)
Bagging and Extra Gradient Boost (XGB), to minimize the probability of misclassification based on any single induced
model. We also apply the stacking classifier (Voting Classifier) to basic learners: Logistic Regression (LR), Decision Tree
(DT), Support-vector clustering (SVC), K-Nearest Neighbors (KNN), Random Forest (RF) and Naïve Bays (NB) to find
out the accuracy obtained by voting classifier (Meta level). To implement the ML algorithm, the data set is divided in the
following manner: 80% is used in the training phase and 20% is used in the test phase. To adjust the classifier, manually
assigned hyper-parameters are used. At different stages of classification, all ML algorithms perform best, with test accuracy
exceeding 90% especially when it is applied to a data subset.

Keywords Classification · Linear regression · Machine learning · Multilayer perceptron · k-Nearest neighbors · Support
vector machine · Ensemble · Stack

Introduction In machine learning, many researchers start their work

from here to discover the severity of breast cancer, that is,
The abnormal growth of human cells is widely known as a whether the tumor is cancerous or non-cancerous. To find
cancer that attacks healthy cells. Abnormal growth of breast answers to these questions, two things are important: what
cells will invade cells around the breast more quickly and is the role of the machine, and how does the machine learn-
spread to other parts of the body. Breast cancer occurs when ing combine medical data to predict the severity of the dis-
a malignant tumor (mass of tissue) occurs in the breast. Two ease. Machine learning is the way to make data decisions
types of breast cancer are: non-cancerous or benign and can- with minimal human intervention. It is part of AI (artificial
cerous or malignant [1]. intelligence), which can learn from data, make decisions,
discover patterns and build analytical models through data
analysis. Clinical or medical data is part of information
This article is part of the topical collection “Advances in related to human health, which is based on routine patient
Computational Approaches for Artificial Intelligence, Image care or clinical trial plans. It includes patient electronic
Processing, IoT and Cloud Applications” guest edited by Bhanu health records based on patient health information. AI can
Prakash K N and M. Shivakumar.
obtain information from health-related data, process the
* Saurabh Pal data, and provide clear output to end users. This process is
drsaurabhpal@yahoo.co.in done through machine learning [2]. The algorithm used by
this technique recognizes the data pattern and gives its own
1
Department of Computer Applications, VBS Purvanchal logic. The main goal of the algorithm used by AI is to find
University, Jaunpur, India

SN Computer Science
Vol.:(0123456789)
270 Page 2 of 11 SN Computer Science (2020) 1:270

out the relationship between prevention or treatment and the the units of measurement are different, we need to standard-
patient’s prognosis [3]. ize the data. Standardization is the process of rescaling one
In this study, the main goal is to obtain the accuracy of or more attributes so that their mean is 0 and the standard
the data set, and the features of dataset will be reduced due deviation is 1. Without standardization, variables meas-
to the statistical method mode. Finally, on the reduced fea- ured at different scales will not contribute equally to the
ture data subset, we apply ensemble techniques to combine analysis and may produce deviations. Clinical data collected
multiple models constructed from a single learning algo- from different organizations for different purposes may be
rithm by systematically changing training data. recorded in different formats. In order for these records to
The rest of this article is described as: (2) Literature have the same format, they must be standardized [20]. To
review, which contains previous studies, based on many obtain a standardized value (z score), of the remaining attrib-
basic learners and their combine techniques ensemble utes of the dataset we use the following formula:
stacking methods to produce a single result by different
X−𝜇
researchers; (3) Explained the suggested technical details z= ,
of the model, including data investigation and preprocess- 𝜎
ing, statistical technique mode and overall framework; (4) where X observation, μ mean and σ standard deviation.
Shows propose the model on data set and data subset and
verify the balance comparative analysis of method and over-
Molding
all structure; at the end, (5) Get conclusion and (6) Future
work discussion.
Figure 2 shows the flow chart of model formation. The
whole process can be divided into four parts.
Literature Review
1. The diagnostic breast cancer data set has 31 attributes,
excluding the patient’s ID number. Feature extraction
In this part of the article, we introduce previous research
techniques are used to extract relevant features with high
related to breast cancer detection and different types of clas-
scores from the dataset.
sifiers that have been used to find accuracy. Table 1 shows a
2. A feature selection technique mode is applied, which
summary of the literature review.
provides only prominent features from multiple attrib-
utes that have precise meanings.
3. Now, different accuracy measures from different classi-
Methodology
fiers are applied on the reduced subset of data.
4. For comparison, all the above classifiers are applied to
For this study, the dataset used the “Wisconsin Breast
a data set with all features to extract performance con-
Cancer (Diagnostic) Data Set” with 569 instances and
clusions and use a reduced data subset for comparative
32 attributes. This data set was created by Dr. Wil-
studies.
liam H. Wolberg of the University of Wisconsin to diag-
nose breast cancer, i.e., (M = malignant, B = benign).
The dataset is located archive.ics.uci.edu/ml/datasets/ Feature Extraction Techniques
Breast + Cancer + Wisconsin + %28Diagnostic %29.
We often pay attention to certain features that contribute the
Data Explanation most to predictors or outputs. The process of selecting this
output variable is called a feature selection method [21]. The
The breast cancer clinical data set contains 569 cases (357 existence of unrelated attributes in the data set may affect the
benign, 212 malignant) reported on November 1, 1995, the accuracy of the data set.
patient ID number and diagnosis (malignant/benign) of each Before data modeling, the importance of feature extrac-
case. The remaining attributes contain 10 real-valued fea- tion may be helpful. Which are: improve accuracy, reduce
tures for each cell nucleus. All information about attributes over fitting, and reduce training time.
is discussed in detail in Table 2. Figure 1 shows the number Following are the feature extraction techniques used in
of patients. this research paper.

Data Preprocessing • Univariate or 𝜒 2 test

In the diagnostic data set for breast cancer, attribute “diag- Chi2 test is often used in hypothesis testing. The chi2
nosis” is replaced by B’s 0 and M’s 1. In the data set, when statistic is a test used to measure the comparison between

SN Computer Science
SN Computer Science (2020) 1:270 Page 3 of 11 270

Table 1 Literature review

Author Year of Classifiers/Ensemble methods Area of application/Disease Accuracy achieved
publication

Elsayad [4] 2010 Ensemble of Bayesian Severity of breast masses 91.83% on training subset and
classifiers(multilayer percep- 90.63% of test
tron neural network)
Huang et al. [5] 2010 Neural Network classifier Breast cancers classification 98.83%
Lavanya and Rani [6] 2011 Decision tree algorithm Breast cancer detection 92.97%
Bekaddour and Chikh [7] 2012 ANFIS (Adaptative Neuro- Breast cancer diagnosis 98.25%
Fuzzy Inference System)
Al-Bahrani et al. [8] 2013 Ensemble voting scheme Prediction model for colon 90.38%, 88.01%, and 85.13%
cancer
Zheng et al. [[9] 2014 K-means and support vector Tumor detection 97.38%
machine (K-SVM)
Vikas et al. [10] 2014 Naive Bayes, Support Vec- Breast cancer SVM-RBF
tor Machine-Radial Basis 96.84%
Function (SVM-RBF) kernel,
Radial Basis
Function neural networks, and
Decision trees
Zhang et al. [11] 2015 Ensemble decision Breast cancer 83.8%, 77.4%, 87.9% and 92.7%
approach(recursive partition
tree)
(Four molecular subtypes: Lumi-
nal-A, Luminal-B, HER2-
amplified and Triple-negative.)
Hazra et al. [[12] 2016 Naïve Bayes, Support Vector Breast cancer classification 97.3978%
Machine, Ensemble classifier each
Nilashi et al. [13] 2017 Expectation Maximization (EM) Breast cancer 93.20%
and classification and regres-
sion trees
(CART) to generate fuzzy rules
Chaurasia et al. [14] 2018 Naive Bayes, RBF network, J48 Breast cancer prediction 97.36%, 96.77%, and 93.94%,
respectively
Emami and Pakzad [15] 2018 Affinity Propagation (AP) clus- Breast cancer diagnosis 98.606%
tering for instances reduction,
Adaptive Modified Binary
Firefly Algorithm (AMBFA)
for selection related predictor
and Vectors Machine (SVM)
technique for prediction
Kadam et al. [16] 2019 Feature ensemble learning based Breast Cancer (prediction 98.60%
on Sparse Autoencoders and benign & malignant)
Softmax Regression
Saritas and Yasar [17] 2019 Artificial neural networks and Estimation of having breast 86.95%
Naïve Bayes classifiers cancer 83.54, respectively
Rahman and Muniyandi [18] 2020 15-neuron network Diagnostic Breast Cancer 99.4%

expected and actually observed data [22]. We use 𝜒 2 (chi2) n

∑ Oi − Ei
test for feature selection to calculate 𝜒 2 between each fea- 𝜒2 = ,
i=1
Ei
ture and the target and select the desired number of fea-
tures with the best 𝜒 2 scores. The following formula is where Oi observations in class i and Ei observations in class
used to estimate the 𝜒 2 value: i if there was no relationship between the feature and target.

SN Computer Science
270 Page 4 of 11 SN Computer Science (2020) 1:270

Table 2 Attribute information [19] rebuild the model, and calculate the importance score area
unit again. First, the formula fits the model to all or any
(1) ID number
predictor variables [24]. Each predictor is a stratified vic-
(2) Diagnosis
tim, which is important for the model. Let S be an ordered
M = malignant, B = benign
sequence of numbers, which is a candidate for the num-
(3–32) Ten real-valued features are computed for each cell nucleus:
ber of predictors to keep (S1 > S2,…). In each function of
(a) radius (mean of distances from center to points on the perimeter)
selection iteration, Si high-level predictors are retained,
(b) texture (standard deviation of gray-scale values)
the model is adjusted and performance is evaluated. The
(c) perimeter
value of Si with the simplest performance is determined;
(d) area
therefore, the high Si predictor will match the final model.
(e) smoothness (local variation in radius lengths)
(f) compactness (perimeter^2/area − 1.0)
• Random forest (RF)
(g) concavity (severity of concave portions of the contour)
(h) concave points (number of concave portions of the contour)
Random forest may be a supervised learning algorithm
(i) symmetry
rule, which is also used for regression in each category.
(j) fractal dimension (“coastline approximation” − 1)
However, it is mainly used for classification problems.
As we all know, forests are composed of trees, and many
trees mean many solid forests. Similarly, the rules of the
random forest algorithm will create a decision tree on the
knowledge samples, so as to obtain predictions from each
knowledge sample, and finally select the simplest solu-
tion by voting [25]. It is a better correlation integration
technique than a decision tree, because it can reduce over
fitting by averaging results.

Statistical Feature Selection Technique (Mode)

Mode is derived from French word LaMode, which means

‘most fashionable item’. Mode is the value which occurs
Fig. 1 Number of patients with Malignant (M) cancerous and Benign
(B) non-cancerous cells
largest time in a series. That is, mode in that point, where
the frequencies in a distribution are maximum. At this
point items tend to most heavily concentrated. There are
• Extra Tree (ET) two methods for calculating mode, i.e., mode by inspection
and mode by grouping. Here we used mode by grouping
Extra Tree classifier is associate degree ensemble method for selecting prominent features from Table 3 of
machine learning technique that may summarize the results all features.
of multiple unrelated decision trees collected within the for- Feature Selection by Grouping Method If attributes are
est and output its classification results. The original training concentrated at more than one value, we find the attributes
samples in every decision tree derive further forests of trees. of concentration by the method of grouping [26]. In this
Then, at every check node, every tree is supplied with a ran- method we prepare a table in which the attributes are first
dom sample of k options from the feature-set from that every arranged by finding different feature selection methods (𝜒 2 ,
call tree should choose the simplest feature to separate the ET, RFE, RF) and their frequencies are written. The group-
information supported some mathematical criteria generally ing table consists of the following columns.
the Gini Index. This random sample of options results in the
creation of multiple uncorrelated decision trees [23]. Column 1 The given frequencies are written and highest
frequency is marked.
• Recursive feature elimination (RFE) Column 2 The frequencies in col.1 are grouped by two’s
and highest total is marked.
Recursive feature elimination is largely a backward Column 3 Leaving the first frequencies of col.1 and group-
selection of predictors. This method first builds a model ing the remaining frequencies by two’s and
on the complete set of predictors to calculate the impor- highest total is marked.
tance score for each predictor. Then Delete the area unit,

SN Computer Science
SN Computer Science (2020) 1:270 Page 5 of 11 270

Fig. 2 Flow of proposed molding

Column 4 Starting from the first frequency of col.1, the test sets follows the ratio is 80:20 and is chosen arbitrarily.
frequencies are grouped in three’s and highest The training set comes from these two datasets (reduced by
total is marked. feature selection and containing all features) are processed
Column 5 Starting from the second (leaving first) fre- and executed by different accuracy measures for compara-
quency of col.1, the frequencies are grouped in tive study. All the analysis process was performed using
three’s and highest total in marked. Python 3.6.
Column 6 Starting from the third (leaving first two) of
col.1, the frequencies are grouped in three’s Feature Extraction
and highest total is marked.
Two levels of feature extraction methods are applied to the
After completing the grouping table, and analysis table is diagnostic breast cancer data set. First, we use univariate
formed to find out attributes which are appearing the high- or 𝜒 2 test, extra trees, recursive feature elimination, and
est number of times. We tick (√) in the values used in the random forest to select the best features from the data set
maximums of each column. [28]. By selecting 15 features from each method, we now
have a total of 60 features. All these features are shown
Accuracy of Classifiers in Table 3

We have found a simplified data set from the above statisti- Mapping Features
cal methods. To find that the reduced data set has sufficient
information related to the patient category (benign/malig- These features need to be mapped for abbreviation, to
nant), we can apply different accuracy measures, such as determining the rank of each feature for analysis. After
basic learners, the accuracy of standardized and tuned data mapping the features from Table 3 in Table 4, the first
sets, ROC curve, ensemble methods and stacking [27]. column shows the attribute name corresponding to the
abbreviated form in the second column. Now, the corre-
sponding columns 3–6 show the attributes that are repeated
Experiment in different feature selection techniques. Finally, column
7 represents the rank obtained by different features. After
This section revolves around survey arrangements, meth- assigning features rank, these 60 features are reduced into
odology In addition, the results of this model on the diag- 18 features.
nostic breast cancer dataset. The division of training and

SN Computer Science
270 Page 6 of 11 SN Computer Science (2020) 1:270

Table 3 Extracted attribute

SN Computer Science
SN Computer Science (2020) 1:270 Page 7 of 11 270

Table 4 Abbreviation and rank Attribute Name Abbreviated as Univariate Extra Tree RFE RF Rank
distribution of attributes
area_worst f1 f1 f1 f1 f1 4
area_mean f2 f2 f2 f2 f2 4
area_se f3 f3 f3 f3 f3 4
perimeter_worst f4 f4 f4 f4 f4 4
perimeter_mean f5 f5 f5 f5 f5 4
radius_worst f6 f6 – f6 f6 3
radius_mean f7 f7 f7 f7 f7 4
perimeter_se f8 f8 – – f8 2
texture_worst f9 f9 f9 f9 f9 4
texture_mean f10 f10 – – f10 2
concavity_worst f11 f11 f11 f11 f11 4
radius_se f12 f12 f12 f12 – 3
concavity_mean f13 f13 f13 f13 f13 4
compactness_worst f14 f14 f14 f14 f14 4
concave points_worst f15 f15 f15 f15 f15 4
concave points_mean f16 – f16 f16 f16 3
compactness_mean f17 – f17 f17 – 2
smoothness_mean f18 – f22 – – 1

Table 5 Grouping table of Attribute Frequency

attributes
I II III IV V VI

f1 4 8 12
f2 4 8 12
f3 4 8 12
f4 4 8 11
f5 4 7 11
f6 3 7 9
f7 4 6 10
f8 2 6 8
f9 4 6 10
f10 2 6 9
f11 4 7 11
f12 3 7 11
f13 4 8 12
f14 4 8 11
f15 4 7 9
f16 3 5 6
f17 2 3
f18 1

Mode The analysis in Table 6 is formed to find the attribute with

the highest number of occurrences. We tick (√) the value
After determining the level of each selected attribute in the used in the maximum value of each column.
above table, we will record these ranks as frequencies in In Table 7, at the end, we obtain the following 11 attrib-
column I of the next Table 5. The subsequent columns II, utes from Analysis, Table 6. Now, these 11 (+ 1 target)
III, IV, V, and VI are frequency sum, as described in the attributes will be used for further analysis to conduct a com-
“Statistical Feature Selection Technique” section. parative study with all attributes to find accuracy indicators.

SN Computer Science
270 Page 8 of 11 SN Computer Science (2020) 1:270

Table 6 Analysis table

Column f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18

I √ √ √ √ √ √ √ √ √ √ √
II √ √ √ √ √ √
III √ √ √ √ √ √
IV √ √ √ √ √ √
V √ √ √
VI √ √ √
No. of occurrences 3 5 6 5 3 – 1 – 1 – 1 – 3 4 3 – – –

Table 7 Abbreviated table of rate” (showing the level of correct classification in the
attributes f1 Aea_worst
positive category) as y axis. The area under the ROC pol-
f2 Area_mean
yline (AuROC) shows that the classifier gives a higher
f3 Area_se
probability of prediction in the case of true positive than
f4 Perimeter_worst
in the case of a true negative. Since the representation of
f5 Perimeter_mean
each classifier is acceptable, it is difficult to identify the
f7 Radius_mean
ROC curve of each classifier in the graph. Each of the two
f9 Texture_worst
figures shows the comparison accuracy to better enhance
f11 Concavity_worst
the visualization. Of the two ROC curve in Table 8, LR
f13 Concavity_mean
has the highest accuracy, i.e., 99%.
f14 Compactness_worst
f15 Concave points_
worst
Ensemble Techniques

This first attempt was to fluctuate application information

Accuracy Metrics and merge various copies of separate learning algorithms
applied to each subset of the data. The basic inspiration
To compare the data set with 30 attributes and 1 target for joining the model is to reduce the possibility of mis-
attribute and the data subset with 11 attributes and 1 target classification that relies on any single excitation model by
attribute. We estimated the accuracy of the basic classifier, mixing the specialized topics of the framework by mixing.
standardized data and adjusted data, ROC curve, Ensemble To be sure, an understandable hypothesis determined by
technique and stacking. the model in metalearning is that there is an ideal learning
Table 7 lists the performance of the 6 basic classifiers algorithm for each assignment [29].
on the data set with 31 attributes and the data subset with In the reduced data subset, we use AdaBoost, Gradi-
12 attributes. For comparison, the performance of the same entBoosting, RandomForest, ExtraTrees, Bagging, and
classifier is appended to the table. In terms of accuracy, the XGBoost as ensemble models. Table 8 shows the ROC
performance of the classifier logistic regression is the best. curve accuracy of the ensemble model. The ExtraTrees
The LR classifier is better on a subset of data with 12 attrib- classifier achieved the highest score, 95.1739%, followed
utes (94.9614% < 95.1836%). by XGBoost 95.1691% and AdaBoost 94.7343% (see
After the data set was standardized, LR achieved higher Table 9).
accuracy again in all 6 classifiers. Tuned accuracy is another
measure of classifier accuracy. If there are too many false
positives in the model, we start to set the sensitivity level to Stacking Classifier
“narrow”. Fine-tuning machine learning prediction models
are used to improve the accuracy of prediction results. In Stack utilization differences among learners. They clearly
both cases, the adjusted accuracy of LR is better than the performed two stages of learning: applying the learner to the
other classifiers. basic level of the work that needs to be done, and applying
The ROC curve of the basic classifier and the proposed another learner to the meta level of the information obtained
subset of data were analyzed. The ROC curve takes the from the basic learning [30].
“false positive rate” (showing the level of misalignment At present, we have started all the models required in the
in the positive category) as x axis and the “true positive Level-0 and stacked models at meta layer. We finally started

SN Computer Science
SN Computer Science
Table 8 Different accuracy metrics
Metrics Dataset with 31 attributes Data subset with 12 attributes

Accuracy of basic classifiers CART: 0.912077 (run time: 0.268554) CART: 0.925362(run time: 0.501953)
SVM: 0.619614 (run time: 0.520508) SVM: 0.619614 (run time: 0.417969)
NB: 0.940773 (run time: 0.040039) NB: 0.938647 (run time: 0.031250)
KNN: 0.927729 (run time: 0.151367) KNN: 0.925507 (run time: 0.106445)
LR: 0.949614(run time: 0.178711) LR: 0.951836 (run time: 1.174805)

(2020) 1:270
MLP: 0.788068 (run time: 1.350586) MLP: 0.740386 (run time: 0.810547)
Standardized data accuracy ScaledCART: 0.914396 (run time: 0.854492) ScaledCART: 0.923092 (run time: 0.049805)
ScaledSVM: 0.964879 (run time: 0.078125) ScaledSVM: 0.958261 (run time: 0.049805)
ScaledNB: 0.931932(run time: 0.031250) ScaledNB: 0.942995 (run time: 0.031250)
ScaledKNN: 0.958357 (run time: 0.076172) ScaledKNN: 0.949565 (run time: 0.031250)
ScaledLR: 0.969324 (run time: 0.118164) ScaledLR: 0.964928 (run time: 0.050781)
ScaledMLP: 0.967101 (run time: 7.107422) ScaledMLP: 0.958309 (run time: 8.562500)
Tuned accuracy (LR) 0.977333 Run Time: 0.013672 (LR) 0.974912 Run Time: 0.832031
ROC curve
SN Computer Science

Page 9 of 11
270
270 Page 10 of 11 SN Computer Science (2020) 1:270

Table 9 Ensemble accuracy of classifiers

Classifier Accuracy Box Plot

AB 94.7343

GBC 93.8599
RF 94.7295
ET 95.1739
Bagging 94.5169
XGBoost 95.1691

to use a stacked model with AdaBoost, random forest, extra produce over fitting and consume time for prediction. Fol-
trees, logistic regression, and decision trees on the 0th layer, lowing these ideas, the legitimacy and clinical estimates of
and a voting classifier on the meta layer. We currently antici- the ensemble model and stack model proposed in this study
pate the relevant variables in the Test dataset and check the were confirmed.
accuracy of this stacked model based on these expectations.
We obtain 92.9824% accuracy from this model.
Discussion

Conclusion The main idea using in this study is statistical technique for
features selection to eliminate redundant attributes from data
In the feature extraction and prediction technique, malig- set. The survey can be used to compare situations, such as
nant growth is the disease with the second highest analysis the type of diabetes, cervical malignant growth endurance
frequency. rate, identifiable evidence of disease tumor cells, and quite
The classification of the classifier shows an incred- different areas, such as sentiment analysis, drug classifica-
ibly great significance, especially for the identification of tion, facial recognition, car driving Pedestrian identification,
malignant cases. This inspection proposes a feature selec- credit score, or spam discovery, where the attributes of data
tion method (mode) using a basic classifier, an ensemble set is necessarily irrelevant or less relevant, indicates a dif-
model with stacking classifiers to classify the instances ference from the specification. In addition, by passing the
with all attributes in comparison to reduced data subset. It important classifier with stacking, ensemble and mode, it
is described as benign or malignant, and achieves an overall does allow the modularity of the entire model. After basic
accuracy of 99% through the basic classifier. On the WBCD information preprocessing, the data set with reduced features
data set, it is 95.1739% in the ensemble model and 92.9824% and binary classification can directly utilize this study proce-
in the stack classifier. By comparing the data set and the data dure. During this period, the model still has some shortcom-
subset, the basic classifier is recognized for its legitimacy in ings. Clinical and clinical data less dedicated for classifica-
stack and ensemble model in terms of accuracy, accuracy at tion, containing more missing values and anomalies, and
standardized data, tuned accuracy and AuROC. more data that may affect the performance of classification.
Unnecessary attributes need not appear in the data set. When managing high-dimensional data sets, precision and
These attributes may affect the accuracy of the data set, may specificity, confusion matrix and other indicators should be

SN Computer Science
SN Computer Science (2020) 1:270 Page 11 of 11 270

considered. These problems make the proposed model not 15. Emami N, Pakzad A. A new knowledge-based system for diag-
directly applicable to the clinic. Similarly, the choice of fea- nosis of breast cancer by a combination of affinity propagation
clustering and firefly algorithm. J AI Data Min. 2018;7:59–68.
ture selection method, the decision of the type and number 16. Kadam VJ, Jadhav SM, Vijayakumar K. Breast cancer diagnosis
of pattern classifiers may additionally affect the execution using feature ensemble learning based on stacked sparse Autoen-
of the performance, just like the time efficiency of alloca- coders and Softmax Regression. J Med Syst. 2019;43:263. https
tion. Future work may include a system to check whether ://doi.org/10.1007/s10916-019-1397-z.
17. Saritas M, Yasar A (2019) Performance Analysis of ANN and
the standard classifier is indeed ideal and try to build it if Naive Bayes classification algorithm for data classification. In:
necessary. With higher dimensions and more examples, deep IJISAE, 2019, vol. 7, no. 2, pp. 88–91.
learning strategies may also help to achieve better classifica- 18. Rahman MA, Muniyandi RC. An enhancement in cancer clas-
tion performance. sification accuracy using a two-step feature selection method
based on artificial neural networks with 15 neurons. Symmetry.
2020;12:271.
19. Dua D, Graff C. UCI Machine Learning Repository [http://archi
Compliance with Ethical Standards ve.ics.uci.edu/ml]. Irvine, CA: University of California, School
of Information and Computer Science. 2019.
Conflict of Interest Authors declare no conflict of Interest. 20. Batyrshin I. Constructing time series shape association meas-
ures: Minkowski distance and data standardization. In: BRICS
CCI 2013, Brasil, Porto de Galhinas. 2013. http://arxiv.org/
pdf/1311.1958v3.
References 21. Kavitha R, Kannan E. An efficient framework for heart disease
classification using feature extraction and feature selection tech-
nique in data mining. in: IEEE Int. Conf. on Emerging Trends in
1. https://www.nationalbreastcancer.org/about-breast-cancer/, 2019.
Engineering Technology and Science (ICETETS), 2016, pp 1–5.
2. Luca M, Kleinberg J, Mullainathan S. Algorithms need managers,
22. Uysal AK, Gunal S, Ergin S. The impact of feature extraction
too. Brighton: Chapman & Hall Ltd; 2016.
and selection on SMS spam filtering. Electronics and Electrical
3. Coiera E. Guide to medical informatics, the Internet and telemedi-
Engineering. 2013;19(5):67–72.
cine. London: Chapman & Hall Ltd; 1997.
23. Maier O, Wilms M, von der Gablentz J, Krämer UM, Münte
4. Elsayad AM. Predicting the severity of breast masses with ensem-
TF, Handels H. Extra Tree forests for sub-acute ischemic stroke
ble of Bayesian classifiers. J Comput Sci. 2010;6(5):576–84.
lesion segmentation in MR sequences. J Neurosci Methods.
5. Huang M, Hung Y, Chen W. Neural network classifier with
2015;240:89–100.
entropy based feature selection on breast cancer diagnosis.
24. Li L, Cui X, Yu S, Zhang Y, Luo Z, Yang H, Zhou Y, Zheng
J Med Syst. 2010;34:865–73. https ://doi.org/10.1007/s1091
X. PSSP-RFE: accurate prediction of protein structural class
6-009-9301-x.
by recursive feature extraction from PSI-BLAST profile, phys-
6. Lavanya D, Rani DK. Analysis of feature selection with classifica-
ical-chemical property and functional annotations. PLoS One.
tion: Breast cancer datasets. Indian J Comput Sci Eng (IJCSE).
2014;9:e92863.
2011;2(5):756–63.
25. Scanlon P, Kennedy IO, Liu Y. Feature extraction approaches to
7. Bekaddour F. A neuro-fuzzy inference model for breast cancer
RF fingerprinting for device identification in femtocells. Bell Labs
recognition. Int J Comput Sci Inf Technol. 2012;4(5):163–73.
Tech J. 2010;15(3):141–51.
8. Al-Bahrani R, Agrawal A, Choudhary A (2013) Colon cancer
26. Kwac K, Lee H, Cho M. Non-Gaussian statistics of amide I
survival prediction using ensemble mining on SEER data. In: Pro-
mode frequency fluctuation of N-methylacetamide in methanol
ceeding of IEEE International Conference on Big Data, pp 9–16.
solution: linear and nonlinear vibrational spectra. J Chem Phys.
9. Zheng B, Yoon SW, Lam SS. Breast cancer diagnosis based on
2004;120:1477–90.
feature extraction using a hybrid of K-means and support vector
27. Labatut V, Cherifi H Accuracy measures for the comparison of
machine algorithms. Expert Syst Appl. 2014;41(4):1476–82.
classifiers. 2012. http://arxiv.org/abs/1207.3790.
10. Chaurasia V, Pal S. Data Mining techniques: to predict and resolve
28. Guyon I, Gunn S, Nikravesh M, Zadeh L, editors. Feature extrac-
breast cancer survivability. IJCSMC. 2014;3:10–22.
tion, foundations and applications. New York: Springer; 2006.
11. Zhang L, Li J, Xiao Y, et al. Identifying ultrasound and clinical
29. Araque O, Corcuera-Platas I, Sanchez-Rada JF, Iglesias CA.
features of breast cancer molecular subtypes by ensemble deci-
Enhancing deep learning sentiment analysis with ensemble tech-
sion. Sci Rep. 2015;5:11085. https://doi.org/10.1038/srep11085.
niques in social applications. Expert Syst Appl. 2017;77:236–46.
12. Hazra A, Mandal S, Gupta A. Study and analysis of breast cancer
30. Malmasi S, Dras M. Native language identification with classifier
cell detection using Naïve Bayes, SVM and ensemble algorithms.
stacking and ensembles. Comput Linguist. 2018;44(3):403–46.
Int J Comput Appl. 2016;145(2):0975–8887.
https://doi.org/10.1162/coli_a_00323.
13. Nilashi M, Ibrahim O, Ahmadi H, Shahmoradi L. A knowledge-
based system for breast cancer classification using fuzzy logic
Publisher’s Note Springer Nature remains neutral with regard to
method. Telemat Inf. 2017;34(4):133–44.
jurisdictional claims in published maps and institutional affiliations.
14. Chaurasia V, Pal S, Tiwari BB. Prediction of benign and malignant
breast cancer using data mining techniques. J Algorithms Comput
Technol. 2018;12(2):119–26.

SN Computer Science

Breast Cancer Prediction Using Machine Learning
No ratings yet
Breast Cancer Prediction Using Machine Learning
8 pages
BS 60080-2020
100% (3)
BS 60080-2020
94 pages
Case History, MSE and Rating Scales - in Psychological Assessment
100% (2)
Case History, MSE and Rating Scales - in Psychological Assessment
29 pages
Introduction To Gerontological Nursing: - Jerald I. Corpuz, RN
100% (3)
Introduction To Gerontological Nursing: - Jerald I. Corpuz, RN
16 pages
Goni 2020
No ratings yet
Goni 2020
5 pages
Neural Network
No ratings yet
Neural Network
15 pages
A Homogeneous Ensemble Classifier For Breast Cancer Detection Using Parameters Tuning of MLP Neural
No ratings yet
A Homogeneous Ensemble Classifier For Breast Cancer Detection Using Parameters Tuning of MLP Neural
22 pages
Breast Cancer Modeling and Prediction Combining
No ratings yet
Breast Cancer Modeling and Prediction Combining
6 pages
A Hybrid Model To Predict The Breast Cancer Using Stacking and Bagging Model
No ratings yet
A Hybrid Model To Predict The Breast Cancer Using Stacking and Bagging Model
6 pages
Journal-Breast Cancer Prediction
No ratings yet
Journal-Breast Cancer Prediction
10 pages
Research Paper Diagnosis
No ratings yet
Research Paper Diagnosis
10 pages
HW Wincon
No ratings yet
HW Wincon
3 pages
Cancer Detection Using Data Mining
No ratings yet
Cancer Detection Using Data Mining
13 pages
Utilizing Cutting-Edge Machine Learning Methods fo_241221_101813 paper
No ratings yet
Utilizing Cutting-Edge Machine Learning Methods fo_241221_101813 paper
7 pages
s41598-022-26378-6_250206_030727
No ratings yet
s41598-022-26378-6_250206_030727
11 pages
1 s2.0 S1877050923001102 Main
No ratings yet
1 s2.0 S1877050923001102 Main
7 pages
br old
No ratings yet
br old
8 pages
Project Report: Bangladesh University of Business & Technology (BUBT)
No ratings yet
Project Report: Bangladesh University of Business & Technology (BUBT)
18 pages
Breast Cancer Prediction Model Assignment
No ratings yet
Breast Cancer Prediction Model Assignment
37 pages
Expert Systems With Applications: Bichen Zheng, Sang Won Yoon, Sarah S. Lam
No ratings yet
Expert Systems With Applications: Bichen Zheng, Sang Won Yoon, Sarah S. Lam
7 pages
A Novel SVM Kernel Classifier Technique Using Supp
No ratings yet
A Novel SVM Kernel Classifier Technique Using Supp
19 pages
Exploring_Machine_Learning_Classifiers_f
No ratings yet
Exploring_Machine_Learning_Classifiers_f
21 pages
br inel
No ratings yet
br inel
11 pages
2019-05 Machine Learning Techniques For Detecting and Predicting Breast Cancer
No ratings yet
2019-05 Machine Learning Techniques For Detecting and Predicting Breast Cancer
5 pages
Breast Cancer Prediction Using Machine Learning
No ratings yet
Breast Cancer Prediction Using Machine Learning
1 page
Yuuy
No ratings yet
Yuuy
5 pages
On Breast Cancer Detection: An Application of Machine Learning Algorithms On The Wisconsin Diagnostic Dataset
No ratings yet
On Breast Cancer Detection: An Application of Machine Learning Algorithms On The Wisconsin Diagnostic Dataset
5 pages
Project Final
No ratings yet
Project Final
15 pages
Enhancing Breast Cancer Diagnosis: A Comparative Analysis of Feature Selection Techniques
No ratings yet
Enhancing Breast Cancer Diagnosis: A Comparative Analysis of Feature Selection Techniques
11 pages
Breast Cancer Diagnosis
No ratings yet
Breast Cancer Diagnosis
31 pages
Breast Cacner Detection
No ratings yet
Breast Cacner Detection
6 pages
On Breast Cancer Detection: An Application of Machine Learning Algorithms On The Wisconsin Diagnostic Dataset
No ratings yet
On Breast Cancer Detection: An Application of Machine Learning Algorithms On The Wisconsin Diagnostic Dataset
5 pages
Breast Cancer Detection With Machine Learning
No ratings yet
Breast Cancer Detection With Machine Learning
7 pages
Feature Selection For Breast Cancer Detection Using Machine Learning Algorithms
No ratings yet
Feature Selection For Breast Cancer Detection Using Machine Learning Algorithms
4 pages
Efficient Breast Cancer Prediction Using Ensemble Machine Learning Models
No ratings yet
Efficient Breast Cancer Prediction Using Ensemble Machine Learning Models
5 pages
Mining Big Data: Breast Cancer Prediction Using DT - SVM Hybrid Model
No ratings yet
Mining Big Data: Breast Cancer Prediction Using DT - SVM Hybrid Model
12 pages
Using Predictive Analytics Model To Diagnose Breast Cnacer
No ratings yet
Using Predictive Analytics Model To Diagnose Breast Cnacer
9 pages
br inel
No ratings yet
br inel
11 pages
BCPUML Breast Cancer Prediction Using Machine Learning Approach—a Performance Analysis
No ratings yet
BCPUML Breast Cancer Prediction Using Machine Learning Approach—a Performance Analysis
10 pages
(IJCST-V11I3P3) :DR M Narendra, A Nandini, T Kamal Raj, V Sai Sowmya, CH Brahma Reddy
No ratings yet
(IJCST-V11I3P3) :DR M Narendra, A Nandini, T Kamal Raj, V Sai Sowmya, CH Brahma Reddy
3 pages
Machine Learning For Breast Cancer Diagnosis A Proof of Concept
No ratings yet
Machine Learning For Breast Cancer Diagnosis A Proof of Concept
27 pages
Sahana S_1BI22MC086
No ratings yet
Sahana S_1BI22MC086
47 pages
Breast Cancer Classification and Prediction Using Machine Learning IJERTV9IS020280
No ratings yet
Breast Cancer Classification and Prediction Using Machine Learning IJERTV9IS020280
5 pages
Intelligent Diagnostic System For The Diagnosis and Prognosis of Breast Cancer Using ANN
No ratings yet
Intelligent Diagnostic System For The Diagnosis and Prognosis of Breast Cancer Using ANN
6 pages
Ankita Patra
No ratings yet
Ankita Patra
17 pages
61_online
No ratings yet
61_online
9 pages
IJERT Developing A Web Based System For
No ratings yet
IJERT Developing A Web Based System For
5 pages
Breast Cancer Detection
No ratings yet
Breast Cancer Detection
15 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Breast Cancer Prediction
No ratings yet
Breast Cancer Prediction
5 pages
Machine Learning With Applications in Breast Cance PDF
No ratings yet
Machine Learning With Applications in Breast Cance PDF
17 pages
Breast Cancer Detection Using SVM Classifier With Grid Search Technique
No ratings yet
Breast Cancer Detection Using SVM Classifier With Grid Search Technique
6 pages
J.eswa.2015.10.015 NEW3
No ratings yet
J.eswa.2015.10.015 NEW3
10 pages
An Intelligent System For Automated Breast Cancer Diagnosis and Prognosis Using SVM Based Classifiers
No ratings yet
An Intelligent System For Automated Breast Cancer Diagnosis and Prognosis Using SVM Based Classifiers
2 pages
CHAPTER ONE to 3-1
No ratings yet
CHAPTER ONE to 3-1
51 pages
s40537 019 0247 7
No ratings yet
s40537 019 0247 7
15 pages
Chapter One to Three
No ratings yet
Chapter One to Three
39 pages
Presentation 3
No ratings yet
Presentation 3
17 pages
Project Title and Abstract
No ratings yet
Project Title and Abstract
17 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Clinical Trial Management – an Overview
From Everand
Clinical Trial Management – an Overview
Editor IJSMI
No ratings yet
Core Concepts in Statistical Learning
From Everand
Core Concepts in Statistical Learning
Tushar Gulati
No ratings yet
Statement of Purposes Sample
No ratings yet
Statement of Purposes Sample
4 pages
Work Immersion
No ratings yet
Work Immersion
37 pages
Methods and Techniques of Repertorisation
No ratings yet
Methods and Techniques of Repertorisation
28 pages
Children Park
No ratings yet
Children Park
23 pages
Chemistry Project
100% (1)
Chemistry Project
12 pages
Thesis Topics in Child Health Nursing
100% (3)
Thesis Topics in Child Health Nursing
5 pages
Cagayan State University Guidance Center Terminal Interview Form
No ratings yet
Cagayan State University Guidance Center Terminal Interview Form
2 pages
Jurnal Skripsi - Reza Ludony - Hukum - Undana
No ratings yet
Jurnal Skripsi - Reza Ludony - Hukum - Undana
20 pages
Festoon: Class Xi (Compulsory) Book-I
No ratings yet
Festoon: Class Xi (Compulsory) Book-I
139 pages
P.E.H REVIEWER Q-2
No ratings yet
P.E.H REVIEWER Q-2
2 pages
Tracking New Package Drafts Address Options Logout: Eudralink User Guide PDF
No ratings yet
Tracking New Package Drafts Address Options Logout: Eudralink User Guide PDF
2 pages
Pedoman Penulisan Karya Tulis Ilmiah Kti Fakultas Kedokteran Umi Makassar 201905280234
No ratings yet
Pedoman Penulisan Karya Tulis Ilmiah Kti Fakultas Kedokteran Umi Makassar 201905280234
2 pages
Gujarat Institute of Nursing Education & Research, Ahmedabad Antenatal Performa
No ratings yet
Gujarat Institute of Nursing Education & Research, Ahmedabad Antenatal Performa
5 pages
Afzal Resume
No ratings yet
Afzal Resume
4 pages
BG2103-Signal-Processing-in-Biosystems
No ratings yet
BG2103-Signal-Processing-in-Biosystems
5 pages
HR Analytics_Unit 3
No ratings yet
HR Analytics_Unit 3
25 pages
PSY403 - Ivon Sagita - 1
No ratings yet
PSY403 - Ivon Sagita - 1
8 pages
National Mental Health Policy Via Vs Via National Health Policy
No ratings yet
National Mental Health Policy Via Vs Via National Health Policy
11 pages
Short Paragraphs - LC Yannick Campuzano Morales
No ratings yet
Short Paragraphs - LC Yannick Campuzano Morales
4 pages
Overview & Guide of The Haccp Worksheets: Main Worksheets Comments Supplementary Worksheets
No ratings yet
Overview & Guide of The Haccp Worksheets: Main Worksheets Comments Supplementary Worksheets
19 pages
answer quiz Fun II 2024
No ratings yet
answer quiz Fun II 2024
2 pages
Republic Act No 10354 RPRH Law
No ratings yet
Republic Act No 10354 RPRH Law
14 pages
(Ebook) Counseling and Family Therapy with Latino Populations: Strategies that Work (Family Therapy and Counseling) by Robert L. Smith, R. Esteban Montilla ISBN 9780415951098, 0415951097 - The latest ebook edition with all chapters is now available
100% (1)
(Ebook) Counseling and Family Therapy with Latino Populations: Strategies that Work (Family Therapy and Counseling) by Robert L. Smith, R. Esteban Montilla ISBN 9780415951098, 0415951097 - The latest ebook edition with all chapters is now available
49 pages
Work and Financial Plan
No ratings yet
Work and Financial Plan
4 pages
Sugammadex - PI
No ratings yet
Sugammadex - PI
15 pages
IMG_20250203_0014
No ratings yet
IMG_20250203_0014
2 pages
OCD-Worksheets-Collection
No ratings yet
OCD-Worksheets-Collection
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer

Uploaded by

Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer

Uploaded by

SN Computer Science (2020) 1:270

Applications of Machine Learning Techniques to Predict Diagnostic

Received: 29 July 2020 / Accepted: 8 August 2020

Introduction In machine learning, many researchers start their work

Data Preprocessing • Univariate or 𝜒 2 test

Table 1 Literature review

expected and actually observed data [22]. We use 𝜒 2 (chi2) n

Statistical Feature Selection Technique (Mode)

Mode is derived from French word LaMode, which means

Fig. 2 Flow of proposed molding

Table 3 Extracted attribute

Table 5 Grouping table of Attribute Frequency

Mode The analysis in Table 6 is formed to find the attribute with

Table 6 Analysis table

This first attempt was to fluctuate application information

Table 9 Ensemble accuracy of classifiers

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.