ML Classification1
ML Classification1
ML Classification1
net/publication/318338750
CITATIONS READS
122 44,786
1 author:
J E T Akinsola
Michael and Cecilia Ibru University (MCIU)
13 PUBLICATIONS 139 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by J E T Akinsola on 11 July 2017.
Abstract ---- Supervised Machine Learning (SML) is are concerned with endowing programs with the
the search for algorithms that reason from ability to learn and adapt [19].
externally supplied instances to produce general Machine Learning has become one of the mainstays
hypotheses, which then make predictions about of Information Technology and with that, a rather
future instances. Supervised classification is one of central, albeit usually hidden, part of our life. With
the tasks most frequently carried out by the the ever increasing amounts of data becoming
intelligent systems. This paper describes various available there is a good reason to believe that smart
Supervised Machine Learning (ML) classification data analysis will become even more pervasive as a
techniques, compares various supervised learning necessary ingredient for technological progress.
algorithms as well as determines the most efficient
classification algorithm based on the data set, the There are several applications for Machine
number of instances and variables (features).Seven Learning (ML), the most significant of which is data
different machine learning algorithms were mining. People are often prone to making mistakes
considered:Decision Table, Random Forest (RF) , during analyses or, possibly, when trying to
Naïve Bayes (NB) , Support Vector Machine (SVM), establish relationships between multiple features [9].
Neural Networks (Perceptron), JRip and Decision
Tree (J48) using Waikato Environment for Data Mining and Machine Learning are
Knowledge Analysis (WEKA)machine learning Siamese twins from which several insights can be
tool.To implement the algorithms, Diabetes data set derived through proper learning algorithms. There
has been tremendous progress in data mining and
was used for the classification with 786 instances
machine learning as a result of evolution of smart
with eight attributes as independent variable and and Nano technology which brought about curiosity
one as dependent variable for the analysis. The in finding hidden patterns in data to derive value.
results show that SVMwas found to be the algorithm The fusion of statistics, machine learning,
with most precision and accuracy. Naïve Bayes and information theory, and computing has created a
Random Forest classification algorithms were found solid science, with a firm mathematical base, and
to be the next accurate after SVM accordingly. The with very powerful tools.
research shows that time taken to build a model and
Machine learning algorithms are organized into
precision (accuracy) is a factor on one hand; while
a taxonomy based on the desired outcome of the
kappa statistic and Mean Absolute Error (MAE) is
algorithm. Supervised learning generates a function
another factor on the other hand. Therefore, ML
that maps inputs to desired outputs.
algorithms requires precision, accuracy and
minimum error to have supervised predictive
Unprecedented data generation has made
machine learning.
machine learning techniques become sophisticated
Keywords: Machine Learning, Classifiers, Data
from time to time. This has called for utilization for
Mining Techniques, Data Analysis, Learning
several algorithms for both supervised and
Algorithms, Supervised Machine Learning
unsupervised machine learning. Supervised learning
is fairly common in classification problems because
INTRODUCTION
the goal is often to get the computer to learn a
Machine learning is one of the fastest growing
classification system that we have created [21].
areas of computer science, with far-reaching
applications. It refers to the automated detection of
ML is perfectly intended for accomplishing the
meaningful patterns in data. Machine learning tools accessibility hidden within Big Data. ML hand
over’s on the guarantee of extracting importance
from big and distinct data sources through outlying algorithms on large and smaller data sets with a
less dependence scheduled on individual track as it view classify them correctly and give insight on
is data determined and spurts at machine scale. how to build supervised machine learning models.
Machine learning is fine suitable towards the
intricacy of handling through dissimilar data origin The remaining part of this work is arranged as
and the vast range of variables as well as amount of follows: Section 2 presents the literature review
data concerned where ML prospers on increasing
discussing classification of different supervised
datasets. The extra data supply into a ML structure,
learning algorithms; section 3 presents the
the more it be able to be trained and concern the
methodology used, section 4 discusses the results of
consequences to superior value of insights. At the
liberty from the confines of individual level thought the work while section 5 gives the conclusion and
and study, ML is clever to find out and show the recommendation for further works.
patterns hidden in the data [15].
I. LITERATURE REVIEW
One standard formulation of the supervised
learning task is the classification problem: The A. Classification of Supervised Learning
learner is required to learn (to approximate the Algorithms
behavior of) a function which maps a vector into According to [21], the supervised machine
one of several classes by looking at several input- learning algorithms which deals more with
output examples of the function. Inductive machine classification includes the following: Linear
learning is the process of learning a set of rules from Classifiers, Logistic Regression, Naïve Bayes
instances (examples in a training set), or more Classifier, Perceptron, Support Vector Machine;
generally speaking, creating a classifier that can be Quadratic Classifiers, K-Means Clustering,
used to generalize from new instances. The process Boosting, Decision Tree, Random Forest (RF);
of applying supervised ML to a real-world problem Neural networks, Bayesian Networks and so on.
is described in Figure 1.
1) Linear Classifiers: Linear models for
classification separate input vectors into classes
using linear (hyperplane) decision boundaries [6].
The goal of classification in linear classifiers in
machine learning, is to group items that have similar
feature values, into groups. [23] stated that a linear
classifier achieves this goal by making a
classification decision based on the value of the
linear combination of the features. A linear classifier
is often used in situations where the speed of
classification is an issue, since it is rated the fastest
classifier [21].Also, linear classifiers often work
very well when the number of dimensions is large,
as in document classification, where each element is
typically the number of counts of a word in a
document. The rate of convergence among data set
variables however depends on the margin. Roughly
speaking, the margin quantifies how linearly
separable a dataset is, and hence how easy it is to
solve a given classification problem [18].
Figure 1: The Processes of Supervised Machine
2) Logistic regression: This is a classification
Learning
function that uses class for building and uses a
This work focuses on the classification of ML single multinomial logistic regression model with a
algorithms and determining the most efficient single estimator. Logistic regression usually states
algorithm with highest accuracy and precision. As where the boundary between the classes exists, also
well as establishing the performance of different states the class probabilities depend on distance
from the boundary, in a specific approach. This 5) Support Vector Machines (SVMs): These
moves towards the extremes (0 and 1) more rapidly are the most recent supervised machine learning
when data set is larger. These statements about technique [24].Support Vector Machine (SVM)
probabilities which make logistic regression more models are closelyrelated to classical multilayer
than just a classifier. It makes stronger, more perceptron neural networks.SVMs revolve around
detailed predictions, and can be fit in a different the notion of a ―margin‖—either side of a
way; but those strong predictions could be wrong. hyperplane that separates two data classes.
Logistic regression is an approach to prediction, like Maximizing the margin and thereby creating the
Ordinary Least Squares (OLS) regression. However, largest possible distance between the separating
with logistic regression, prediction results in a hyperplane and the instances on either side of it has
dichotomous outcome [13]. Logistic regression is been proven to reduce an upper bound on the
one of the most commonly used tools for applied expected generalisation error [9].
statistics and discrete data analysis. Logistic
regression is linear interpolation[11]. 6) K-means: According to [2] and [22]K-
means is one of the simplest unsupervised learning
3) Naive Bayesian (NB) Networks: These algorithms that solve the well-known clustering
are very simple Bayesian networks which are problem. The procedure follows a simple and easy
composed of directed acyclic graphs with only one way to classify a given data set through a certain
parent (representing the unobserved node) and number of clusters (assume k clusters) fixed a
several children (corresponding to observed nodes) priori.K-Means algorithm is be employed when
with a strong assumption of independence among labeled data is not available [1].General method of
child nodes in the context of their parent [7].Thus, converting rough rules of thumb into highly accurate
the independence model (Naive Bayes) is based on prediction rule. Given ―weak‖ learning algorithm
estimating [14]. Bayes classifiers are usually less that can consistently find classifiers (―rules of
accurate that other more sophisticated learning thumb‖) at least slightly better than random, say,
algorithms (such as ANNs).However, [5] performed accuracy _ 55%, with sufficient data, a boosting
a large-scale comparison of the naive Bayes algorithm can provably construct single classifier
classifier with state-of-the-art algorithms for with very high accuracy, say, 99% [16].
decision tree induction, instance-based learning, and
rule induction on standard benchmark datasets, and 7) Decision Trees: Decision Trees (DT) are
found it to be sometimes superior to the other trees that classify instances by sorting them based
learning schemes, even on datasets with substantial on feature values. Each node in a decision tree
feature dependencies. Bayes classifier has attribute- represents a feature in an instance to be classified,
independence problem which was addressed with and each branch represents a value that the node can
Averaged One-Dependence Estimators [8]. assume. Instances are classified starting at the root
node and sorted based on their feature values
4) Multi-layer Perceptron: This is a [9].Decision tree learning, used in data mining and
classifier in which the weights of the network are machine learning, uses a decision tree as a
found by solving a quadratic programming problem predictive model which maps observations about an
with linear constraints, rather than by solving a non- item to conclusions about the item's target value.
convex, unconstrained minimization problem as in More descriptive names for such tree models are
standard neural network training [21].Other well- classification trees or regression trees [20].Decision
known algorithms are based on the notion of tree classifiers usually employ post-pruning
perceptron [17].Perceptron algorithm is used for techniques that evaluate the performance of decision
learning from a batch of training instances by trees, as they are pruned by using a validation set.
running the algorithm repeatedly through the Any node can be removed and assigned the most
training set until it finds a prediction vector which is common class of the training instances that are
correct on all of the training set. This prediction rule sorted to it [9].
is then used for predicting the labels on the test set
[9]. 8) Neural Networks:[2]opined Neural
Networks (NN) that can actually perform a number
of regression and/or classification tasks at once,
although commonly each network performs only 6. Providing partial nodes ordering, that is,
one. In the vast majority of cases, therefore, the declare that a node appears earlier than
network will have a single output variable, although another node in the ordering.
in the case of many-state classification problems, 7. Providing a complete node ordering.
this may correspond to a number of output units (the
post-processing stage takes care of the mapping B. Features of Machine Learning Algorithms
from output units to output variables).Artificial Supervised machine learning techniques are
Neural Network (ANN) depends upon three applicable in numerous domains. A number of
fundamental aspects, input and activation functions Machine Learning (ML) application oriented papers
of the unit, network architecture and the weight of can be found in [18], [25].
each input connection. Given that the first two Generally, SVMs and neural networks tend to
aspects are fixed, the behavior of the ANN is perform much better when dealing with multi-
defined by the current values of the weights. The dimensions and continuous features. On the other
weights of the net to be trained are initially set to hand, logic-based systems tend to perform better
random values, and then instances of the training set when dealing with discrete/categorical features. For
are repeatedly exposed to the net. The values for the neural network models and SVMs, a large sample
input of an instance are placed on the input units and size is required in order to achieve its maximum
the output of the net is compared with the desired prediction accuracy whereas NB may need a
output for this instance. Then, all the weights in the relatively small dataset.
net are adjusted slightly in the direction that would
bring the output values of the net closer to the There is general agreement that k-NN is very
values for the desired output. There are several sensitive to irrelevant features: this characteristic
algorithms with which a network can be trained can be explained by the way the algorithm works.
Moreover, the presence of irrelevant features can
[12].
make neural network training very inefficient, even
impractical.Most decision tree algorithms cannot
9) Bayesian Network: A Bayesian Network perform well with problems that require diagonal
(BN) is a graphical model for probability partitioning. The division of the instance space is
relationships among a set of variables (features). orthogonal to the axis of one variable and parallel to
Bayesian networks are the most well-known all other axes. Therefore, the resulting regions after
representative of statistical learning algorithms partitioning are all hyperrectangles. The ANNs and
the SVMs perform well when multi-collinearity is
[9].The most interesting feature of BNs, compared
present and a nonlinear relationship exists between
to decision trees or neural networks, is most the input and output features.
certainly the possibility of taking into account prior
information about a given problem, in terms of Naive Bayes (NB) requires little storage space
structural relationships among its features [9].A during both the training and classification stages: the
problem of BN classifiers is that they are not strict minimum is the memory needed to store the
suitable for datasets with many features [4].This prior and conditional probabilities. The basic kNN
prior expertise, or domain knowledge, about the algorithm uses a great deal of storage space for the
structure of a Bayesian network can take the training phase, and its execution space is at least as
following forms: big as its training space. On the contrary, for all
1. Declaring that a node is a root node, i.e., it non-lazy learners, execution space is usually much
has no parents. smaller than training space, since the resulting
2. Declaring that a node is a leaf node, i.e., it classifier is usually a highly condensed summary of
has no children. the data. Moreover, Naive Bayes and the kNN can
3. Declaring that a node is a direct cause or be easily used as incremental learners whereas rule
direct effect of another node. algorithms cannot. Naive Bayes is naturally robust
4. Declaring that a node is not directly to missing values since these are simply ignored in
connected to another node. computing probabilities and hence have no impact
5. Declaring that two nodes are independent, on the final decision. On the contrary, kNN and
given a condition-set. neural networks require complete records to do their
work.
Finally, Decision Trees and NB generally have type of algorithm that will perform well. There is no
different operational profiles, when one is very single learning algorithm that will outperform other
accurate the other is not and vice versa. On the algorithms based on all data sets according to no
contrary, decision trees and rule classifiers have a free lunch theorem. [10] Table 1 presents the
similar operational profile. SVM and ANN have comparative analysis of various learning algorithms.
also a similar operational profile. No single learning
algorithm can uniformly outperform other
algorithms over all datasets.
Different data sets with different kind of
variables and the number of instances determine the
Table 1: Comparing learning algorithms (**** stars represent the best and * star the worst performance)[9]
Table 2 shows 768 as the total number of instances research work was carried out by tuning the
used for this research work with 500 tested positive parameters with two different sets of number of
for diabetes and 268 tested negative for diabetes. instances. The first category was 768 instances and
9 attributes as follows (Number of times pregnant,
The comparative analysis among various Plasma glucose concentration a 2 hours in an oral
supervised machine learning algorithms was carried glucose tolerance test, Diastolic blood pressure (mm
out using WEKA 3.7.13 (WEKA - Waikato Hg), Triceps skin fold thickness (mm), 2-Hour
Environment for Knowledge Analysis). The data set serum insulin (mu U/ml), Body mass index (weight
was trained to reflect one nominal attribute column in kg/(height in m)^2), Diabetes pedigree function,
as the dependent variable. The values 1’s for class Age (years) and Class variable (0 or 1)) with one
distribution (class variable) were changed to YES dependent variable and eight independent variables.
which means tested POSITIVE for DIABETS and The second category of data set was 384 instances
values 0s for class distribution (class variable) were and 6 attributes as follows (Number of times
changed NO which means tested NEGATIVE for pregnant, Plasma glucose concentration a 2 hours in
DIABETES. This is essential because most of the an oral glucose tolerance test, 2-Hour serum insulin
algorithms require that there must be at least one (mu U/ml), Diabetes pedigree function, Age (years)
nominal variable column. Seven classification and Class variable (0 or 1)) with one dependent
algorithms were used in the course of this research variable and five independent variables.
namely: Decision Table, Random Forest, Naïve
Bayes, SVM, Neural Networks (Perceptron), JRip IV. RESULTS AND DISCUSSION
and Decision Tree (J48). The following attributes A. Results
were considered for the comparative analysis: Time,
Correctly Classified, Incorrectly Classified, Test WEKA was used in the classification and
Mode, No of instances, Kappa statistic, MAE, comparison of the various machine leaning
Precision of YES, Precision of NO and algorithms. Table 3 shows the resultswith 9
attributes as well as parameters considered.
Classification.
JRip 0.19 74.4792 25.5208 10-fold 9 768 0.4171 0.3461 0.659 0.780 Rules
cross-
validation
Decision 0.14 73.8281 26.1719 10-fold 9 768 0.4164 0.3158 0.632 0.790 Tree
Tree (J48) cross-
validation
Time is the TIME taking to build the model
MAE (Mean Absolute Error) is a measure of how YES means tested positive to diabetes. NO means
close forecast or predictions are to the eventual tested negative for diabetes
outcome. Table 4 shows the results with 6 attributes of the
Kappa Statistic is a metric that compares an classification and comparison of the various
observed accuracy with an expected accuracy machine leaning algorithms and parameters
(Random Chance) considered.
Table 4: Comparison of various classification algorithms with smaller data set and less attributes
Algorithm Time Correctly Incorrectly Test Attributes No of Kappa MAE Precision Precision Classific-
Classified Classified Mode instances statistic of YES of NO ation
% %
Decision 0.09 67.9688 32.0313 10-fold 6 384 0.3748 0.3101 0.581 0.734 Rules
Table cross-
validation
Random 0.42 71.875 28.125 10-fold 6 384 0.3917 0.3438 0.639 0.761 Trees
Forest cross-
validation
Naïve Bayes 0.01 70.5729 29.4271 10-fold 6 364 0.352 0.3297 0.633 0.739 Bayes
cross-
validation
SVM 0.04 72.9167 27.0833 10-fold 6 0.3837 0.2708 0.711 0.735 Functions
cross- 384
validation
Neural 0.17 59 41 10-fold 6 384 0.1156 0.4035 0.444 0.672 Functions
Networks cross-
(Perceptron) validation
JRip 0.01 64 36 10-fold 6 384 0.2278 0.4179 0.514 0.714 Rules
cross-
validation
Decision 0.03 64 % 36 10-fold 6 384 0.1822 0.4165 0.519 0.685 Tree
Tree (J48) cross-
validation
Time is the TIME taking to build the model. Kappa Statistic is a metric that compares an
MAE (Mean Absolute Error) is a measure of how observed accuracy with an expected accuracy
close forecast or predictions are to the eventual (Random Chance)
outcome. YES means tested positive to diabetes. NO means
tested negative for diabetes
Table 5 and 6: Ranking of Precision of Positive Diabetes and Negative Diabetes using different algorithms
showing smaller and larger data sets respectively
Table 7 and 8: Ranking of Correctly Classified and Incorrectly Classified with the time to build the model
showing smaller and larger data sets respectively using different algorithm.
regardless of the number of attributes and data London 1950.Copyright © The Royal Institute of
Philosophy 1951,pp. 163-164.doi:
instances. This research shows that time to build a https://doi.org/10.1017/S0031819100026863. Availableat
model is one factor on one hand; and precision with Royal Institute of Philosophy website:
https://www.cambridge.org/core/journals/philosophy/article
kappa statistic while MAE is another factor on the /probability-and-the-weighing-of-evidence-by-goodi-j-
other hand. Therefore, ML algorithms requires london-charles-griffin-and-company-1950-pp-viii-119-
precision, accuracy and minimum error to have price-16s/7D911224F3713FDCFD1451BBB2982442
supervised predictive machine learning. [8] Hormozi, H., Hormozi, E. & Nohooji, H. R. (2012). The
Classification of the Applicable Machine Learning Methods
in Robot Manipulators. International Journal of Machine
This work recommends that for large data sets, Learning and Computing (IJMLC), Vol. 2, No. 5, 2012
a distributed processing environment should be doi: 10.7763/IJMLC.2012.V2.189pp. 560 – 563.
Available at IJMLC website:
considered. This will create room for high level of http://www.ijmlc.org/papers/189-C00244-001.pdf
correlation among the variables which will
[9] Kotsiantis, S. B. (2007). Supervised Machine Learning: A
ultimately make the output of the model more Review of Classification Techniques. Informatica 31
efficient. (2007). Pp. 249 – 268. Retrieved from IJS website:
http://wen.ijs.si/ojs-
2.4.3/index.php/informatica/article/download/148/140.
REFERENCES
[10] Lemnaru C. (2012). Strategies for dealing with Real World
[1] Alex S.& Vishwanathan, S.V.N. (2008). Introduction to Classification Problems, (Unpublished PhD thesis) Faculty
Machine Learning. Published by the press syndicate of the of Computer Science
University of Cambridge, Cambridge, United Kingdom. and Automation, Universitatea Technica,
Copyright ⓒ Cambridge University Press 2008. ISBN: 0- Din Cluj-Napoca. Available at website:
521-82583-0. Available at KTH website: http://users.utcluj.ro/~cameliav/documents/TezaFinalLemna
https://www.kth.se/social/upload/53a14887f276540ebc81ae ru.pdf
c3/online.pdf Retrieved from website:
http://alex.smola.org/drafts/thebook.pdf [11] Logistic Regression pp. 223 – 237. Available at:
https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12
[2] Bishop, C. M. (1995). Neural Networks for Pattern .pdf
Recognition. Clarendon Press, Oxford, England. 1995.
Oxford University Press, Inc. New York, NY, USA ©1995 [12] Neocleous C. & Schizas C. (2002). Artificial Neural
ISBN:0198538642 Available at: Network Learning: A Comparative Review. In: Vlahavas
http://cs.du.edu/~mitchell/mario_books/Neural_Networks_f I.P., Spyropoulos C.D. (eds)Methods and Applications of
or_Pattern_Recognition_-_Christopher_Bishop.pdf Artificial Intelligence. Hellenic Conference on Artificial
[3] Brazdil P., Soares C. &da Costa, J. (2003). IntelligenceSETN 2002. Lecture Notes in Computer
Ranking Learning Algorithms: Using IBL Science, Volume 2308. Springer, Berlin, Heidelberg, doi:
and Meta-Learning on Accuracy and Time 10.1007/3-540-46014-4_27 pp. 300-313. Available at:
Results.Machine LearningVolume 50, https://link.springer.com/chapter/10.1007/3-540-46014-
Issue 3,2003.Copyright ©Kluwer 4_27
Academic Publishers. Manufactured in The Netherlands, .
doi:10.1023/A:1021713901879pp. 251–277. Available at [13] Newsom, I. (2015). Data Analysis II:
Springer website: Logistic Regression. Available at:
https://link.springer.com/content/pdf/10.1023%2FA%3A10 http://web.pdx.edu/~newsomj/da2/ho_logistic.pdf
21713901879.pdf
[14] Nilsson, N.J. (1965). Learning machines. New York:
[4] Cheng, J., Greiner, R., Kelly, J., Bell, D.& Liu, W. (2002). McGraw-Hill.Published in: Journal of IEEE Transactions
Learning Bayesian networks from data: An information- on Information Theory Volume 12 Issue 3, 1966. doi:
theory based approach. Artificial Intelligence Volume 137. 10.1109/TIT.1966.1053912 pp. 407 – 407. Available at
Copyright © 2002. Published by Elsevier Science B.V. All ACM digital library website:
rights reserved pp. 43 – 90. Available at science Direct: http://dl.acm.org/citation.cfm?id=2267404
http://www.sciencedirect.com/science/article/pii/S00043702 [15] Pradeep, K. R. & Naveen, N. C. (2017). A Collective Study
02001911 of Machine Learning (ML)Algorithms with Big Data
Analytics (BDA) for Healthcare Analytics (HcA).
[5] Domingos, P. & Pazzani, M. (1997). On the optimality of International Journal of Computer Trends and Technology
the simple Bayesian classifier under zero-one loss. Machine (IJCTT) – Volume 47 Number 3, 2017. ISSN: 2231-2803,
Learning Volume 29, pp. 103–130 Copyright © 1997 doi: 10.14445/22312803/IJCTT-V47P121, pp 149 – 155.
Kluwer Academic Publishers. Manufactured in The Available from IJCTT website:
Netherlands. Available at University of Trento website: http://www.ijcttjournal.org/2017/Volume47/number-
http://disi.unitn.it/~p2p/RelatedWork/Matching/domingos9 3/IJCTT-V47P121.pdf
7optimality.pdf
[16] Rob Schapire (n.d) Machine Learning Algorithms for
[6] Elder, J. (n.d). Introduction to Machine Learning and Classifrication.
Pattern Recognition. Available at LASSONDE University
EECS Department York website: [17] Rosenblatt, F. (1962), Principles of Neurodynamics.
http://www.eecs.yorku.ca/course_archive/2011-12/F/4404- Spartan, New York.
5327/lectures/01%20Introduction.pd
[18] Setiono R. and Loew, W. K. (2000), FERNN: An algorithm
[7] Good, I.J. (1951). Probability and the Weighing of for fast extraction of rules from neural networks, Applied
Evidence, Philosophy Volume 26, Issue 97, 1951. Published Intelligence.
by Charles Griffin and Company,