Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer
Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer
https://doi.org/10.1007/s42979-020-00296-8
ORIGINAL RESEARCH
Abstract
This article compares six machine learning (ML) algorithms: Classification and Regression Tree (CART), Support Vector
Machine (SVM), Naïve Bayes (NB), K-Nearest Neighbors (KNN), Linear Regression (LR) and Multilayer Perceptron (MLP)
on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset by estimating their classification test accuracy, standardized
data accuracy and runtime analysis. The main objective of this study is to improve the accuracy of prediction using a new
statistical method of feature selection. The data set has 32 features, which are reduced using statistical techniques (mode),
and the same measurements as above are applied for comparative studies. In the reduced attribute data subset (12 features),
we applied 6 integrated models AdaBoost (AB), Gradient Boosting Classifier (GBC), Random Forest (RF), Extra Tree (ET)
Bagging and Extra Gradient Boost (XGB), to minimize the probability of misclassification based on any single induced
model. We also apply the stacking classifier (Voting Classifier) to basic learners: Logistic Regression (LR), Decision Tree
(DT), Support-vector clustering (SVC), K-Nearest Neighbors (KNN), Random Forest (RF) and Naïve Bays (NB) to find
out the accuracy obtained by voting classifier (Meta level). To implement the ML algorithm, the data set is divided in the
following manner: 80% is used in the training phase and 20% is used in the test phase. To adjust the classifier, manually
assigned hyper-parameters are used. At different stages of classification, all ML algorithms perform best, with test accuracy
exceeding 90% especially when it is applied to a data subset.
Keywords Classification · Linear regression · Machine learning · Multilayer perceptron · k-Nearest neighbors · Support
vector machine · Ensemble · Stack
SN Computer Science
Vol.:(0123456789)
270 Page 2 of 11 SN Computer Science (2020) 1:270
out the relationship between prevention or treatment and the the units of measurement are different, we need to standard-
patient’s prognosis [3]. ize the data. Standardization is the process of rescaling one
In this study, the main goal is to obtain the accuracy of or more attributes so that their mean is 0 and the standard
the data set, and the features of dataset will be reduced due deviation is 1. Without standardization, variables meas-
to the statistical method mode. Finally, on the reduced fea- ured at different scales will not contribute equally to the
ture data subset, we apply ensemble techniques to combine analysis and may produce deviations. Clinical data collected
multiple models constructed from a single learning algo- from different organizations for different purposes may be
rithm by systematically changing training data. recorded in different formats. In order for these records to
The rest of this article is described as: (2) Literature have the same format, they must be standardized [20]. To
review, which contains previous studies, based on many obtain a standardized value (z score), of the remaining attrib-
basic learners and their combine techniques ensemble utes of the dataset we use the following formula:
stacking methods to produce a single result by different
X−𝜇
researchers; (3) Explained the suggested technical details z= ,
of the model, including data investigation and preprocess- 𝜎
ing, statistical technique mode and overall framework; (4) where X observation, μ mean and σ standard deviation.
Shows propose the model on data set and data subset and
verify the balance comparative analysis of method and over-
Molding
all structure; at the end, (5) Get conclusion and (6) Future
work discussion.
Figure 2 shows the flow chart of model formation. The
whole process can be divided into four parts.
Literature Review
1. The diagnostic breast cancer data set has 31 attributes,
excluding the patient’s ID number. Feature extraction
In this part of the article, we introduce previous research
techniques are used to extract relevant features with high
related to breast cancer detection and different types of clas-
scores from the dataset.
sifiers that have been used to find accuracy. Table 1 shows a
2. A feature selection technique mode is applied, which
summary of the literature review.
provides only prominent features from multiple attrib-
utes that have precise meanings.
3. Now, different accuracy measures from different classi-
Methodology
fiers are applied on the reduced subset of data.
4. For comparison, all the above classifiers are applied to
For this study, the dataset used the “Wisconsin Breast
a data set with all features to extract performance con-
Cancer (Diagnostic) Data Set” with 569 instances and
clusions and use a reduced data subset for comparative
32 attributes. This data set was created by Dr. Wil-
studies.
liam H. Wolberg of the University of Wisconsin to diag-
nose breast cancer, i.e., (M = malignant, B = benign).
The dataset is located archive.ics.uci.edu/ml/datasets/ Feature Extraction Techniques
Breast + Cancer + Wisconsin + %28Diagnostic %29.
We often pay attention to certain features that contribute the
Data Explanation most to predictors or outputs. The process of selecting this
output variable is called a feature selection method [21]. The
The breast cancer clinical data set contains 569 cases (357 existence of unrelated attributes in the data set may affect the
benign, 212 malignant) reported on November 1, 1995, the accuracy of the data set.
patient ID number and diagnosis (malignant/benign) of each Before data modeling, the importance of feature extrac-
case. The remaining attributes contain 10 real-valued fea- tion may be helpful. Which are: improve accuracy, reduce
tures for each cell nucleus. All information about attributes over fitting, and reduce training time.
is discussed in detail in Table 2. Figure 1 shows the number Following are the feature extraction techniques used in
of patients. this research paper.
In the diagnostic data set for breast cancer, attribute “diag- Chi2 test is often used in hypothesis testing. The chi2
nosis” is replaced by B’s 0 and M’s 1. In the data set, when statistic is a test used to measure the comparison between
SN Computer Science
SN Computer Science (2020) 1:270 Page 3 of 11 270
Elsayad [4] 2010 Ensemble of Bayesian Severity of breast masses 91.83% on training subset and
classifiers(multilayer percep- 90.63% of test
tron neural network)
Huang et al. [5] 2010 Neural Network classifier Breast cancers classification 98.83%
Lavanya and Rani [6] 2011 Decision tree algorithm Breast cancer detection 92.97%
Bekaddour and Chikh [7] 2012 ANFIS (Adaptative Neuro- Breast cancer diagnosis 98.25%
Fuzzy Inference System)
Al-Bahrani et al. [8] 2013 Ensemble voting scheme Prediction model for colon 90.38%, 88.01%, and 85.13%
cancer
Zheng et al. [[9] 2014 K-means and support vector Tumor detection 97.38%
machine (K-SVM)
Vikas et al. [10] 2014 Naive Bayes, Support Vec- Breast cancer SVM-RBF
tor Machine-Radial Basis 96.84%
Function (SVM-RBF) kernel,
Radial Basis
Function neural networks, and
Decision trees
Zhang et al. [11] 2015 Ensemble decision Breast cancer 83.8%, 77.4%, 87.9% and 92.7%
approach(recursive partition
tree)
(Four molecular subtypes: Lumi-
nal-A, Luminal-B, HER2-
amplified and Triple-negative.)
Hazra et al. [[12] 2016 Naïve Bayes, Support Vector Breast cancer classification 97.3978%
Machine, Ensemble classifier each
Nilashi et al. [13] 2017 Expectation Maximization (EM) Breast cancer 93.20%
and classification and regres-
sion trees
(CART) to generate fuzzy rules
Chaurasia et al. [14] 2018 Naive Bayes, RBF network, J48 Breast cancer prediction 97.36%, 96.77%, and 93.94%,
respectively
Emami and Pakzad [15] 2018 Affinity Propagation (AP) clus- Breast cancer diagnosis 98.606%
tering for instances reduction,
Adaptive Modified Binary
Firefly Algorithm (AMBFA)
for selection related predictor
and Vectors Machine (SVM)
technique for prediction
Kadam et al. [16] 2019 Feature ensemble learning based Breast Cancer (prediction 98.60%
on Sparse Autoencoders and benign & malignant)
Softmax Regression
Saritas and Yasar [17] 2019 Artificial neural networks and Estimation of having breast 86.95%
Naïve Bayes classifiers cancer 83.54, respectively
Rahman and Muniyandi [18] 2020 15-neuron network Diagnostic Breast Cancer 99.4%
SN Computer Science
270 Page 4 of 11 SN Computer Science (2020) 1:270
Table 2 Attribute information [19] rebuild the model, and calculate the importance score area
unit again. First, the formula fits the model to all or any
(1) ID number
predictor variables [24]. Each predictor is a stratified vic-
(2) Diagnosis
tim, which is important for the model. Let S be an ordered
M = malignant, B = benign
sequence of numbers, which is a candidate for the num-
(3–32) Ten real-valued features are computed for each cell nucleus:
ber of predictors to keep (S1 > S2,…). In each function of
(a) radius (mean of distances from center to points on the perimeter)
selection iteration, Si high-level predictors are retained,
(b) texture (standard deviation of gray-scale values)
the model is adjusted and performance is evaluated. The
(c) perimeter
value of Si with the simplest performance is determined;
(d) area
therefore, the high Si predictor will match the final model.
(e) smoothness (local variation in radius lengths)
(f) compactness (perimeter^2/area − 1.0)
• Random forest (RF)
(g) concavity (severity of concave portions of the contour)
(h) concave points (number of concave portions of the contour)
Random forest may be a supervised learning algorithm
(i) symmetry
rule, which is also used for regression in each category.
(j) fractal dimension (“coastline approximation” − 1)
However, it is mainly used for classification problems.
As we all know, forests are composed of trees, and many
trees mean many solid forests. Similarly, the rules of the
random forest algorithm will create a decision tree on the
knowledge samples, so as to obtain predictions from each
knowledge sample, and finally select the simplest solu-
tion by voting [25]. It is a better correlation integration
technique than a decision tree, because it can reduce over
fitting by averaging results.
SN Computer Science
SN Computer Science (2020) 1:270 Page 5 of 11 270
Column 4 Starting from the first frequency of col.1, the test sets follows the ratio is 80:20 and is chosen arbitrarily.
frequencies are grouped in three’s and highest The training set comes from these two datasets (reduced by
total is marked. feature selection and containing all features) are processed
Column 5 Starting from the second (leaving first) fre- and executed by different accuracy measures for compara-
quency of col.1, the frequencies are grouped in tive study. All the analysis process was performed using
three’s and highest total in marked. Python 3.6.
Column 6 Starting from the third (leaving first two) of
col.1, the frequencies are grouped in three’s Feature Extraction
and highest total is marked.
Two levels of feature extraction methods are applied to the
After completing the grouping table, and analysis table is diagnostic breast cancer data set. First, we use univariate
formed to find out attributes which are appearing the high- or 𝜒 2 test, extra trees, recursive feature elimination, and
est number of times. We tick (√) in the values used in the random forest to select the best features from the data set
maximums of each column. [28]. By selecting 15 features from each method, we now
have a total of 60 features. All these features are shown
Accuracy of Classifiers in Table 3
We have found a simplified data set from the above statisti- Mapping Features
cal methods. To find that the reduced data set has sufficient
information related to the patient category (benign/malig- These features need to be mapped for abbreviation, to
nant), we can apply different accuracy measures, such as determining the rank of each feature for analysis. After
basic learners, the accuracy of standardized and tuned data mapping the features from Table 3 in Table 4, the first
sets, ROC curve, ensemble methods and stacking [27]. column shows the attribute name corresponding to the
abbreviated form in the second column. Now, the corre-
sponding columns 3–6 show the attributes that are repeated
Experiment in different feature selection techniques. Finally, column
7 represents the rank obtained by different features. After
This section revolves around survey arrangements, meth- assigning features rank, these 60 features are reduced into
odology In addition, the results of this model on the diag- 18 features.
nostic breast cancer dataset. The division of training and
SN Computer Science
270 Page 6 of 11 SN Computer Science (2020) 1:270
SN Computer Science
SN Computer Science (2020) 1:270 Page 7 of 11 270
Table 4 Abbreviation and rank Attribute Name Abbreviated as Univariate Extra Tree RFE RF Rank
distribution of attributes
area_worst f1 f1 f1 f1 f1 4
area_mean f2 f2 f2 f2 f2 4
area_se f3 f3 f3 f3 f3 4
perimeter_worst f4 f4 f4 f4 f4 4
perimeter_mean f5 f5 f5 f5 f5 4
radius_worst f6 f6 – f6 f6 3
radius_mean f7 f7 f7 f7 f7 4
perimeter_se f8 f8 – – f8 2
texture_worst f9 f9 f9 f9 f9 4
texture_mean f10 f10 – – f10 2
concavity_worst f11 f11 f11 f11 f11 4
radius_se f12 f12 f12 f12 – 3
concavity_mean f13 f13 f13 f13 f13 4
compactness_worst f14 f14 f14 f14 f14 4
concave points_worst f15 f15 f15 f15 f15 4
concave points_mean f16 – f16 f16 f16 3
compactness_mean f17 – f17 f17 – 2
smoothness_mean f18 – f22 – – 1
f1 4 8 12
f2 4 8 12
f3 4 8 12
f4 4 8 11
f5 4 7 11
f6 3 7 9
f7 4 6 10
f8 2 6 8
f9 4 6 10
f10 2 6 9
f11 4 7 11
f12 3 7 11
f13 4 8 12
f14 4 8 11
f15 4 7 9
f16 3 5 6
f17 2 3
f18 1
SN Computer Science
270 Page 8 of 11 SN Computer Science (2020) 1:270
I √ √ √ √ √ √ √ √ √ √ √
II √ √ √ √ √ √
III √ √ √ √ √ √
IV √ √ √ √ √ √
V √ √ √
VI √ √ √
No. of occurrences 3 5 6 5 3 – 1 – 1 – 1 – 3 4 3 – – –
Table 7 Abbreviated table of rate” (showing the level of correct classification in the
attributes f1 Aea_worst
positive category) as y axis. The area under the ROC pol-
f2 Area_mean
yline (AuROC) shows that the classifier gives a higher
f3 Area_se
probability of prediction in the case of true positive than
f4 Perimeter_worst
in the case of a true negative. Since the representation of
f5 Perimeter_mean
each classifier is acceptable, it is difficult to identify the
f7 Radius_mean
ROC curve of each classifier in the graph. Each of the two
f9 Texture_worst
figures shows the comparison accuracy to better enhance
f11 Concavity_worst
the visualization. Of the two ROC curve in Table 8, LR
f13 Concavity_mean
has the highest accuracy, i.e., 99%.
f14 Compactness_worst
f15 Concave points_
worst
Ensemble Techniques
SN Computer Science
SN Computer Science
Table 8 Different accuracy metrics
Metrics Dataset with 31 attributes Data subset with 12 attributes
Accuracy of basic classifiers CART: 0.912077 (run time: 0.268554) CART: 0.925362(run time: 0.501953)
SVM: 0.619614 (run time: 0.520508) SVM: 0.619614 (run time: 0.417969)
NB: 0.940773 (run time: 0.040039) NB: 0.938647 (run time: 0.031250)
KNN: 0.927729 (run time: 0.151367) KNN: 0.925507 (run time: 0.106445)
LR: 0.949614(run time: 0.178711) LR: 0.951836 (run time: 1.174805)
(2020) 1:270
MLP: 0.788068 (run time: 1.350586) MLP: 0.740386 (run time: 0.810547)
Standardized data accuracy ScaledCART: 0.914396 (run time: 0.854492) ScaledCART: 0.923092 (run time: 0.049805)
ScaledSVM: 0.964879 (run time: 0.078125) ScaledSVM: 0.958261 (run time: 0.049805)
ScaledNB: 0.931932(run time: 0.031250) ScaledNB: 0.942995 (run time: 0.031250)
ScaledKNN: 0.958357 (run time: 0.076172) ScaledKNN: 0.949565 (run time: 0.031250)
ScaledLR: 0.969324 (run time: 0.118164) ScaledLR: 0.964928 (run time: 0.050781)
ScaledMLP: 0.967101 (run time: 7.107422) ScaledMLP: 0.958309 (run time: 8.562500)
Tuned accuracy (LR) 0.977333 Run Time: 0.013672 (LR) 0.974912 Run Time: 0.832031
ROC curve
SN Computer Science
Page 9 of 11
270
270 Page 10 of 11 SN Computer Science (2020) 1:270
AB 94.7343
GBC 93.8599
RF 94.7295
ET 95.1739
Bagging 94.5169
XGBoost 95.1691
to use a stacked model with AdaBoost, random forest, extra produce over fitting and consume time for prediction. Fol-
trees, logistic regression, and decision trees on the 0th layer, lowing these ideas, the legitimacy and clinical estimates of
and a voting classifier on the meta layer. We currently antici- the ensemble model and stack model proposed in this study
pate the relevant variables in the Test dataset and check the were confirmed.
accuracy of this stacked model based on these expectations.
We obtain 92.9824% accuracy from this model.
Discussion
Conclusion The main idea using in this study is statistical technique for
features selection to eliminate redundant attributes from data
In the feature extraction and prediction technique, malig- set. The survey can be used to compare situations, such as
nant growth is the disease with the second highest analysis the type of diabetes, cervical malignant growth endurance
frequency. rate, identifiable evidence of disease tumor cells, and quite
The classification of the classifier shows an incred- different areas, such as sentiment analysis, drug classifica-
ibly great significance, especially for the identification of tion, facial recognition, car driving Pedestrian identification,
malignant cases. This inspection proposes a feature selec- credit score, or spam discovery, where the attributes of data
tion method (mode) using a basic classifier, an ensemble set is necessarily irrelevant or less relevant, indicates a dif-
model with stacking classifiers to classify the instances ference from the specification. In addition, by passing the
with all attributes in comparison to reduced data subset. It important classifier with stacking, ensemble and mode, it
is described as benign or malignant, and achieves an overall does allow the modularity of the entire model. After basic
accuracy of 99% through the basic classifier. On the WBCD information preprocessing, the data set with reduced features
data set, it is 95.1739% in the ensemble model and 92.9824% and binary classification can directly utilize this study proce-
in the stack classifier. By comparing the data set and the data dure. During this period, the model still has some shortcom-
subset, the basic classifier is recognized for its legitimacy in ings. Clinical and clinical data less dedicated for classifica-
stack and ensemble model in terms of accuracy, accuracy at tion, containing more missing values and anomalies, and
standardized data, tuned accuracy and AuROC. more data that may affect the performance of classification.
Unnecessary attributes need not appear in the data set. When managing high-dimensional data sets, precision and
These attributes may affect the accuracy of the data set, may specificity, confusion matrix and other indicators should be
SN Computer Science
SN Computer Science (2020) 1:270 Page 11 of 11 270
considered. These problems make the proposed model not 15. Emami N, Pakzad A. A new knowledge-based system for diag-
directly applicable to the clinic. Similarly, the choice of fea- nosis of breast cancer by a combination of affinity propagation
clustering and firefly algorithm. J AI Data Min. 2018;7:59–68.
ture selection method, the decision of the type and number 16. Kadam VJ, Jadhav SM, Vijayakumar K. Breast cancer diagnosis
of pattern classifiers may additionally affect the execution using feature ensemble learning based on stacked sparse Autoen-
of the performance, just like the time efficiency of alloca- coders and Softmax Regression. J Med Syst. 2019;43:263. https
tion. Future work may include a system to check whether ://doi.org/10.1007/s10916-019-1397-z.
17. Saritas M, Yasar A (2019) Performance Analysis of ANN and
the standard classifier is indeed ideal and try to build it if Naive Bayes classification algorithm for data classification. In:
necessary. With higher dimensions and more examples, deep IJISAE, 2019, vol. 7, no. 2, pp. 88–91.
learning strategies may also help to achieve better classifica- 18. Rahman MA, Muniyandi RC. An enhancement in cancer clas-
tion performance. sification accuracy using a two-step feature selection method
based on artificial neural networks with 15 neurons. Symmetry.
2020;12:271.
19. Dua D, Graff C. UCI Machine Learning Repository [http://archi
Compliance with Ethical Standards ve.ics.uci.edu/ml]. Irvine, CA: University of California, School
of Information and Computer Science. 2019.
Conflict of Interest Authors declare no conflict of Interest. 20. Batyrshin I. Constructing time series shape association meas-
ures: Minkowski distance and data standardization. In: BRICS
CCI 2013, Brasil, Porto de Galhinas. 2013. http://arxiv.org/
pdf/1311.1958v3.
References 21. Kavitha R, Kannan E. An efficient framework for heart disease
classification using feature extraction and feature selection tech-
nique in data mining. in: IEEE Int. Conf. on Emerging Trends in
1. https://www.nationalbreastcancer.org/about-breast-cancer/, 2019.
Engineering Technology and Science (ICETETS), 2016, pp 1–5.
2. Luca M, Kleinberg J, Mullainathan S. Algorithms need managers,
22. Uysal AK, Gunal S, Ergin S. The impact of feature extraction
too. Brighton: Chapman & Hall Ltd; 2016.
and selection on SMS spam filtering. Electronics and Electrical
3. Coiera E. Guide to medical informatics, the Internet and telemedi-
Engineering. 2013;19(5):67–72.
cine. London: Chapman & Hall Ltd; 1997.
23. Maier O, Wilms M, von der Gablentz J, Krämer UM, Münte
4. Elsayad AM. Predicting the severity of breast masses with ensem-
TF, Handels H. Extra Tree forests for sub-acute ischemic stroke
ble of Bayesian classifiers. J Comput Sci. 2010;6(5):576–84.
lesion segmentation in MR sequences. J Neurosci Methods.
5. Huang M, Hung Y, Chen W. Neural network classifier with
2015;240:89–100.
entropy based feature selection on breast cancer diagnosis.
24. Li L, Cui X, Yu S, Zhang Y, Luo Z, Yang H, Zhou Y, Zheng
J Med Syst. 2010;34:865–73. https ://doi.org/10.1007/s1091
X. PSSP-RFE: accurate prediction of protein structural class
6-009-9301-x.
by recursive feature extraction from PSI-BLAST profile, phys-
6. Lavanya D, Rani DK. Analysis of feature selection with classifica-
ical-chemical property and functional annotations. PLoS One.
tion: Breast cancer datasets. Indian J Comput Sci Eng (IJCSE).
2014;9:e92863.
2011;2(5):756–63.
25. Scanlon P, Kennedy IO, Liu Y. Feature extraction approaches to
7. Bekaddour F. A neuro-fuzzy inference model for breast cancer
RF fingerprinting for device identification in femtocells. Bell Labs
recognition. Int J Comput Sci Inf Technol. 2012;4(5):163–73.
Tech J. 2010;15(3):141–51.
8. Al-Bahrani R, Agrawal A, Choudhary A (2013) Colon cancer
26. Kwac K, Lee H, Cho M. Non-Gaussian statistics of amide I
survival prediction using ensemble mining on SEER data. In: Pro-
mode frequency fluctuation of N-methylacetamide in methanol
ceeding of IEEE International Conference on Big Data, pp 9–16.
solution: linear and nonlinear vibrational spectra. J Chem Phys.
9. Zheng B, Yoon SW, Lam SS. Breast cancer diagnosis based on
2004;120:1477–90.
feature extraction using a hybrid of K-means and support vector
27. Labatut V, Cherifi H Accuracy measures for the comparison of
machine algorithms. Expert Syst Appl. 2014;41(4):1476–82.
classifiers. 2012. http://arxiv.org/abs/1207.3790.
10. Chaurasia V, Pal S. Data Mining techniques: to predict and resolve
28. Guyon I, Gunn S, Nikravesh M, Zadeh L, editors. Feature extrac-
breast cancer survivability. IJCSMC. 2014;3:10–22.
tion, foundations and applications. New York: Springer; 2006.
11. Zhang L, Li J, Xiao Y, et al. Identifying ultrasound and clinical
29. Araque O, Corcuera-Platas I, Sanchez-Rada JF, Iglesias CA.
features of breast cancer molecular subtypes by ensemble deci-
Enhancing deep learning sentiment analysis with ensemble tech-
sion. Sci Rep. 2015;5:11085. https://doi.org/10.1038/srep11085.
niques in social applications. Expert Syst Appl. 2017;77:236–46.
12. Hazra A, Mandal S, Gupta A. Study and analysis of breast cancer
30. Malmasi S, Dras M. Native language identification with classifier
cell detection using Naïve Bayes, SVM and ensemble algorithms.
stacking and ensembles. Comput Linguist. 2018;44(3):403–46.
Int J Comput Appl. 2016;145(2):0975–8887.
https://doi.org/10.1162/coli_a_00323.
13. Nilashi M, Ibrahim O, Ahmadi H, Shahmoradi L. A knowledge-
based system for breast cancer classification using fuzzy logic
Publisher’s Note Springer Nature remains neutral with regard to
method. Telemat Inf. 2017;34(4):133–44.
jurisdictional claims in published maps and institutional affiliations.
14. Chaurasia V, Pal S, Tiwari BB. Prediction of benign and malignant
breast cancer using data mining techniques. J Algorithms Comput
Technol. 2018;12(2):119–26.
SN Computer Science