WEKA
WEKA
WEKA
A PROJECT REPORT
On
“ERROR PREDICTION USING W.E.K.A”
Submitted to
KIIT Deemed to be University
In Partial Fulfilment of the Requirement for the Award of
BACHELOR’S DEGREE
IN COMPUTER SCIENCE
AND ENGINEERING
BY
UNDER THE
GUIDANCE OF Dr.
Ajay Kumar Jena
Acknowledgement
We would like to thank our mentor Mr. A.K. Jena for providing constant support and
giving us a broader picture of the whole scenario without which the completion of the
project would not have been possible. His constant endeavor and guidance has helped us
complete the project within the stipulated time. Last but not the least, we would like to
thank our fellow project mates for helping us out with all the problems that we faced and
making our experience an enjoyable one.
4
CERTIFICATE
This is certify that the project entitled
“ERROR PREDICTION USING W.E.K.A.’’
Submitted by
is a record of bonafide work carried out by them, in the partial fulfilment of the requirement for
the award of Degree of Bachelor of Engineering (Computer Sci- ence & Engineering) at KIIT
Deemed to be university, Bhubaneswar. This work is done during year 2018-2019, under our
guidance.
Date: 02/03/2019
Declaration
We declare that this written submission represents our best ideas in our own words and
where others ideas or words have been included, we have adequately cited and referenced
the original sources. We also declare that we have adhered to all principles of academic
honesty and integrity and have not misinterpreted or fabricated or falsified any
idea/data/fact/source in our submission. We understand that any violation of the above will
cause for disciplinary action by the Institute and can also evoke penal action from the
sources and which have thus not been properly cited or from whom proper permission has
not been taken when needed.
Date: 02 April,2019
6
ABSTRACT
In this project we have attempted to analyze different Medical Datasets and The Pre-Poll
Survey of 2012 U.S. Presidential Elections by applying different classification and clustering
algorithms to better understand the nature of the data and their relationships among themselves
and to determine the most optimal model to be used for providing most accurate predictions.
We have used WEKA workbench to perform all the operations on the datasets. We have mainly
used the Graphical User Interface called the Explorer for classifying, clustering and visualizing
the datasets.
7
CONTENTS:
List of Figures:
CHAPTER - 1
INTRODUCTION
1.1 OVERVIEW
Traditional analytics tools are not suited for capturing the full value of big data.
The volume of the big data is too large for comprehensive analysis, and the range of potential
correlations and relationships between disparate data sources i.e. from back end customer
databases to live web based clickstreams i.e. are too great for any analyst to test all hypotheses
and derive all the value buried in the data.
Basic analytical methods used in business intelligence and enterprise reporting tools reduce to
reporting sums, counts, simple averages and running SQL queries. Online analytical processing
is merely a systematized extension of these basic analytics that still rely on a human to direct
activities specify what should be calculated.
Machine learning is ideal for exploiting the opportunities hidden in big data. It delivers on the
promise of extracting value from big and disparate data sources with far less reliance on human
direction. It is data driven and runs at machine scale. Machine Learning is well suited to the
complexity of dealing with disparate data sources and the huge variety of variables and amounts
of data involved. And unlike traditional analysis, machine learning thrives on growing datasets.
The more data fed into a machine learning system, the more it can learn and apply the results to
higher quality insights.
Freed from the limitations of human scale thinking and analysis, machine learning is able to
discover and display the patterns buried in the data.
The team at Waikato has incorporated several standard Machine Learning techniques into software
“Workbench” abbreviated WEKA (Waikato Environment for Knowledge Analysis). With the use
of WEKA, a specialist in a particular field is able to use ML and derive useful knowledge from
databases that are far too large to be analyzed by hand.
1.2 OBJECTIVE
The main purpose of descriptive analysis is to better understand the large and complex datasets,
find out the errors, most efficient algorithms for classification and clustering for future use in
machine learning and thereby using the knowledge of machine learning for implementation in
Artificial Neural Networking (ANN).
CHAPTER – 2
REVIEW OF LITERATURE
Similar research and projects have been done on Machine Learning using WEKA tool some of
which are, [1] Ramamohan Y. et. al. (2012) Data mining tools are used for prediction in the
business and make fruitful decisions. This paper gives the brief overview for different data
mining tools like Weka, Tanagra, Rapid Miner, DBMiner, Witness Miner, and Orange.
[2] Bhaise R. B. et. al. (2013) Education Data Mining is the main theme of this paper, for better
student’s education. For this, the author used K-Means or Clustering techniques on the sample
data. This technique is used to analyze the data from different dimensions and categorize the
data. They made clusters according to the student’s performance in the examination. The
information generated after implementing mining technique is very much useful for teacher as
well as student.
[3] Borkar S. and Rjeswari K. (2013) Association rule mining is useful to evaluate student
performance in the study. In this paper, for data analysis Weka tool is used. The main goal of this
paper is predicting student performance in the university exam on basis of the criteria internal
exam, assignment, attendance, etc. This paper concluded that the result of university result will
be improved of the poor students by giving extra efforts in their unit test, attendance, assignment
and graduation.
[4] Chaurasia V. and Pal S. (2013) this paper gives the survey information of different data
mining techniques for medical person for decision making. From this, the doctors can predict the
presence of heart disease. This paper used Naive Bayes, J48 Decision Tree, Bagging techniques
in the field of heart disease diagnosis. As a result, the bagging algorithm is better from others
because it gives human readable classification rules.
13
CHAPTER – 3
MACHINE LEARNING
Over the past two decades Machine Learning has become one of the main-stays of information
technology and with that, a rather central, albeit usually hidden, part of our life. With the ever
increasing amounts of data becoming available there is good reason to believe that smart data
analysis will become even more pervasive as a necessary ingredient for technological progress.
The purpose of this chapter is to provide the reader with an overview over the vast range of
applications which have at their heart a machine learning problem and to bring some degree of
order to the zoo of problems. After that, we will discuss some basic tools from statistics and
probability theory, since they form the language in which many machine learning problems must
be phrased to become amenable to solving.
Machine learning can appear in many guises. We now discuss a number of applications, the
types of data they deal with, and finally, we formalize the problems in a somewhat more stylized
fashion. The latter is key if we want to avoid reinventing the wheel for every new application.
Instead, much of the art of machine learning is to reduce a range of fairly disparate problems to a
set of fairly narrow prototypes. Much of the science of machine learning is then to solve those
problems and provide good guarantees for the solutions.
3.1.1 Applications
Most readers will be familiar with the concept of web page ranking. That is, the process of
submitting a query to a search engine, which then finds webpages relevant to the query and
which returns them in their order of relevance. A rather related application is collaborative
filtering. Internet bookstores such as Amazon, or video rental sites such as Netflix use this
information extensively to entice users to purchase additional goods (or rent more movies). The
problem is quite similar to the one of web page ranking. As before, we want to obtain a sorted
list (in this case of articles). The key difference is that an explicit query is missing and instead we
can only use past purchase and viewing decisions of the user to predict future viewing and
purchase habits. The key side information here are the decisions made by similar users, hence the
collaborative nature of the process.
rather arduous task, in particular given that text is not always grammatically correct, nor is the
document understanding part itself a trivial one. Instead, we could simply use examples of
translated documents, such as the proceedings of the Canadian parliament or other multilingual
entities (United Nations, European Union, Switzerland) to learn how to translate between the two
languages. In other words, we could use examples of translations to learn how to translate. This
machine learning approach proved quite successful.
Many security applications, e.g. for access control, use face recognition as one of its
components. That is, given the photo (or video recording) of a person, recognize who this person
is. In other words, the system needs to classify the faces into one of many categories (Alice, Bob,
Charlie...) or decide that it is an unknown face. A similar, yet conceptually quite different
problem is that of verification. Here the goal is to verify whether the person in question is who
he claims to be. Note that differently to before, this is now a yes/no question. To deal with
different lighting conditions, facial expressions, whether a person is wearing glasses, hairstyle,
etc., it is desirable to have a system which learns which features are relevant for identifying a
Person.
Other applications which take advantage of learning are speech recognition (annotate an audio
sequence with text, such as the system shipping with Microsoft Vista), the recognition of
handwriting (annotate a sequence of strokes with text, a feature common to many PDAs), track
pads of computers (e.g. Synaptic, a major manufacturer of such pads derives its name from the
synapses of a neural network), the detection of failure in jet engines, avatar behavior in computer
games (e.g. Black and White), direct marketing (companies use past purchase behavior to
guesstimate whether you might be willing to purchase even more) and cleaning robots (such as
iRobot's Roomba). The overarching theme of learning problems is that there exists a nontrivial
dependence between some observations, which we will commonly refer to as x and a desired
response, which we refer to as y, for which a simple set of deterministic rules is not known. By
using learning we can infer such a dependency between x and y in a systematic fashion.
3.2 Data
It is useful to characterize learning problems according to the type of data they use. This is a
great help when encountering new challenges, since quite often problems on similar data types
can be solved with very similar techniques. For instance natural language processing and
bioinformatics use very similar tools for strings of natural language text and for DNA sequences.
Vectors constitute the most basic entity we might encounter in our work. For instance, a life
insurance company might be interesting in obtaining the vector of variables (blood pressure,
heart rate, height, weight, cholesterol level, smoker, gender) to infer the life expectancy of a
potential customer. A farmer might be interested in determining the ripeness of fruit based on
(size, weight, and spectral data). An engineer might want to find dependencies in (voltage,
current) pairs. Likewise one might want to represent documents by a vector of counts which
describe the occurrence of words. The latter is commonly referred to as bag of words features.
One of the challenges in dealing with vectors is that the scales and units of different coordinates
may vary widely. For instance, we could measure the height in kilograms, pounds, grams, tons,
stones, all of which would amount to multiplicative changes. Likewise, when representing
15
Lists: In some cases the vectors we obtain may contain a variable number of features. For
instance, a physician might not necessarily decide to perform a full battery of diagnostic tests if
the patient appears to be healthy.
Sets may appear in learning problems whenever there is a large number of potential causes of an
effect, which are not well determined. For instance, it is relatively easy to obtain data concerning
the toxicity of mushrooms. It would be desirable to use such data to infer the toxicity of a new
mushroom given information about its chemical compounds. However, mushrooms contain a
cocktail of compounds out of which one or more may be toxic. Consequently we need to infer
the properties of an object given a set of features, whose composition and number may vary
considerably.
Images could be thought of as two dimensional arrays of numbers, that is, matrices. This
representation is very crude, though, since they exhibit spatial coherence (lines, shapes) and
(natural images exhibit) a multi resolution structure. That is, down sampling an image leads to an
object which has very similar statistics to the original image. Computer vision and psych optics
have created a raft of tools for describing these phenomena.
Video adds a temporal dimension to images. Again, we could represent them as a three
dimensional array. Good algorithms, however, take the temporal coherence of the image
sequence into account.
Trees and Graphs are often used to describe relations between collections of objects. For
instance the ontology of webpages of the DMOZ project has the form of a tree with topics
becoming increasingly refined as we traverse from the root to one of the leaves. In the case of
gene ontology the relationships form a directed acyclic graph, also referred to as the
GO-DAG.
16
1. Naïve Bayes.
2. Nearest Neighbor Estimators.
3. A Simple Classifier.
4. Perceptron.
5. K-Means.
6. J48.
17
CHAPTER - 4
CLASSIFICATION
4.1 INTRODUCTION
Classification is the task of learning a target function f that maps each attribute set x to one of
the predefined class labels y. The target function is also known informally as a classification
model.
Predictive Modeling - A classification model can also be used to predict the class label of
unknown records. A classification model can be treated as a black box that automatically assigns
a class label when presented with the attribute set of an unknown record.
Classification techniques are most suited for predicting or describing data sets with binary or
nominal categories. They are less effective for ordinal categories (e.g., to classify a person as a
member of high-, medium-, or low income group) because they do not consider the implicit
order among the categories. Other forms of relationships, such as the subclass–superclass
relationships among categories (e.g., humans and apes are primates, which in turn, is a subclass
of mammals) are also ignored. The remainder of this chapter focuses only on binary or nominal
class labels.
First, a training set consisting of records whose class labels are known must be provided. The
training set is used to build a classification model, which is subsequently applied to the test set,
which consists of records with unknown class labels.
18
Evaluation of the performance of a classification model is based on the counts of test records
correctly and incorrectly predicted by the model. These counts are tabulated in a table known as
a confusion matrix. Table 3.1 depicts the confusion matrix for a binary classification problem.
Each entry fij in this table denotes the number of records from class i predicted to be of class j.
For instance, f01 is the number of records from class 0 incorrectly predicted as class 1. Based on
the entries in the confusion matrix, the total number of correct predictions made by the model is
(f11 + f00) and the total number of incorrect predictions is (f10 + f01).
Class1 Class2
Although a confusion matrix provides the information needed to determine how well a
classification model performs, summarizing this information with a single number would make it
more convenient to compare the performance of different models. This can be done using a
performance metric such as accuracy, which is defined as follows:
Equivalently, the performance of a model can be expressed in terms of its error rate, which is
given by the following equation:
Most classification algorithms seek models that attain the highest accuracy, or equivalently, the
lowest error rate when applied to the test set.
To illustrate how classification with a decision tree works, consider a simpler version of the
vertebrate classification problem. Instead of classifying the vertebrates into five distinct groups
19
of species, we assign them to two categories: mammals and non-mammals. Suppose a new
species is discovered by scientists. How can we tell whether it is a mammal or a non-mammal?
One approach is to pose a series of questions about the characteristics of the species. The first
question we may ask is whether the species is cold- or warm-blooded. If it is cold-blooded, then
it is definitely not a mammal. Otherwise, it is either a bird or a mammal. In the latter case, we
need to ask a follow-up question: Do the females of the species give birth to their young? Those
that do give birth are definitely mammals, while those that do not are likely to be non-mammals
(with the exception of egg-laying mammals such as the platypus and spiny anteater).
The previous example illustrates how we can solve a classification problem by asking a series of
carefully crafted questions about the attributes of the test record. Each time we receive an
answer, a follow-up question is asked until we reach a conclusion about the class label of the
record. The series of questions and their possible answers can be organized in the form of a
decision tree, which is a hierarchical structure consisting of nodes and directed edges.
• Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.
• Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.
In a decision tree, each leaf node is assigned a class label. The nonterminal nodes, which
include the root and other internal nodes, contain attribute test conditions to separate records that
have different characteristics. Classifying a test record is straightforward once a decision tree has
been constructed. Starting from the root node, we apply the test condition to the record and
follow the appropriate branch based on the outcome of the test. This will lead us either to another
internal node, for which a new test condition is applied, or to a leaf node. The class label
associated with the leaf node is then assigned to the record.
A learning algorithm for inducing decision trees must address the following two issues.
1. How should the training records be split? Each recursive step of the tree-growing
process must select an attribute test condition to divide the records into smaller subsets.
To implement this step, the algorithm must provide a method for specifying the test
condition for different attribute types as well as an objective measure for evaluating the
goodness of each test condition.
2. How should the splitting procedure stop? A stopping condition is needed to terminate
the tree-growing process. A possible strategy is to continue expanding a node until either
all the records belong to the same class or all the records have identical attribute values.
Although both conditions are sufficient to stop any decision tree induction algorithm,
other criteria can be imposed to allow the tree-growing procedure to terminate earlier.
20
Decision tree induction algorithms must provide a method for expressing an attribute test
condition and its corresponding outcomes for different attribute types.
Binary Attributes The test condition for a binary attribute generates two potential outcomes.
Nominal Attributes Since a nominal attribute can have many values, its test condition can be
expressed in two ways. For a multi way split, the number of outcomes depends on the number of
distinct values for the corresponding attribute. For example, if an attribute such as marital status
has three distinct values—single, married, or divorced—its test condition will produce a three-
way split. On the other hand, some decision tree algorithms, such as CART, produce only binary
splits by considering all 2k−1 − 1 ways of creating a binary partition of k attribute values.
Ordinal Attributes Ordinal attributes can also produce binary or multi way splits. Ordinal
attribute values can be grouped as long as the grouping does not violate the order property of the
attribute values.
Continuous Attributes For continuous attributes, the test condition can be expressed as a
comparison test (A < v) or (A ≥ v) with binary outcomes, or a range query with outcomes of the
form vi ≤ A < vi+1, for i = 1, . . . , k. For the binary case, the decision tree algorithm must
consider all possible split positions v, and it selects the one that produces the best partition. For
the multi way split, the algorithm must consider all possible ranges of continuous values. After
discretization, a new ordinal value will be assigned to each discretized interval. Adjacent
intervals can also be aggregated into wider ranges as long as the order property is preserved.
21
CHAPTER - 5
CLUSTERING
Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If
meaningful groups are the goal, then the clusters should capture the natural structure of the data.
In some cases, however, cluster analysis is only a useful starting point for other purposes, such as
data summarization. Whether for understanding or utility, cluster analysis has long played an
important role in a wide variety of fields: psychology and other social sciences, biology,
statistics, pattern recognition, information retrieval, machine learning, and data mining.
There have been many applications of cluster analysis to practical problems. We provide some
specific examples, organized by whether the purpose of the clustering is understanding or utility.
Clustering for Understanding Classes, or conceptually meaningful groups of objects that share
common characteristics, play an important role in how people analyze and describe the world.
Indeed, human beings are skilled at dividing objects into groups (clustering) and assigning
particular objects to these groups (classification). For example, even relatively young children
can quickly label the objects in a photograph as buildings, vehicles, people, animals, plants, etc.
In the context of understanding data, clusters are potential classes and cluster analysis is the
study of techniques for automatically finding classes. The following are some examples:
• Biology. Biologists have spent many years creating a taxonomy (hierarchical classification) of
all living things: kingdom, phylum, class, order, family, genus, and species. Thus, it is perhaps
not surprising that much of the early work in cluster analysis sought to create a discipline of
mathematical taxonomy that could automatically find such classification structures. More
recently, biologists have applied clustering to analyze the large amounts of genetic information
that are now available. For example, clustering has been used to find groups of genes that have
similar functions.
• Information Retrieval. The World Wide Web consists of billions of Web pages, and the
results of a query to a search engine can return thousands of pages. Clustering can be used to
group these search results into a small number of clusters, each of which captures a particular
aspect of the query. For instance, a query of “movie” might return Web pages grouped into
categories such as reviews, trailers, stars, and theaters. Each category (cluster) can be broken into
subcategories (sub clusters), producing a hierarchical structure that further assists a user’s
exploration of the query results.
22
• Climate. Understanding the Earth’s climate requires finding patterns in the atmosphere and
ocean. To that end, cluster analysis has been applied to find patterns in the atmospheric pressure
of Polar Regions and areas of the ocean that have a significant impact on land climate.
• Psychology and Medicine. An illness or condition frequently has a number of variations, and
cluster analysis can be used to identify these different subcategories. For example, clustering has
been used to identify different types of depression. Cluster analysis can also be used to detect
patterns in the spatial or temporal distribution of a disease.
• Business. Businesses collect large amounts of information on current and potential customers.
Clustering can be used to segment customers into a small number of groups for additional
analysis and marketing activities.
Clustering for Utility Cluster analysis provides an abstraction from individual data objects to
the clusters in which those data objects reside. Additionally, some clustering techniques
characterize each cluster in terms of a cluster prototype; i.e., a data object that is representative
of the other objects in the cluster. These cluster prototypes can be used as the basis for a number
of data analysis or data processing techniques. Therefore, in the context of utility, cluster
analysis is the study of techniques for finding the most representative cluster prototypes.
• Summarization. Many data analysis techniques, such as regression or PCA, have a time or
space complexity of O (m2) or higher (where m is the number of objects), and thus, are not
practical for large data sets. However, instead of applying the algorithm to the entire data set, it
can be applied to a reduced data set consisting only of cluster prototypes. Depending on the type
of analysis, the number of prototypes, and the accuracy with which the prototypes represent the
data, the results can be comparable to those that would have been obtained if all the data could
have been used.
• Compression. Cluster prototypes can also be used for data compression. In particular, a table is
created that consists of the prototypes for each cluster; i.e., each prototype is assigned an integer
value that is its position (index) in the table. Each object is represented by the index of the
prototype associated with its cluster. This type of compression is known as vector quantization
and is often applied to image, sound, and video data, where (1) many of the data objects are
highly similar to one another, (2) some loss of information is acceptable, and (3) a substantial
reduction in the data size is desired.
• Efficiently Finding Nearest Neighbors. Finding nearest neighbors can require computing the
pairwise distance between all points. Often clusters and their cluster prototypes can be found
much more efficiently. If objects are relatively close to the prototype of their cluster, then we can
use the prototypes to reduce the number of distance computations that are necessary to find the
nearest neighbors of an object. Intuitively, if two cluster prototypes are far apart, then the objects
in the corresponding clusters cannot be nearest neighbors of each other. Consequently, to find an
object’s nearest neighbors it is only necessary to compute the distance to objects in nearby
clusters, where the nearness of two clusters is measured by the distance between their prototypes.
23
Also, while the terms segmentation and partitioning are sometimes used as synonyms for
clustering, these terms are frequently used for approaches outside the traditional bounds of
cluster analysis. For example, the term partitioning is often used in connection with techniques
that divide graphs into sub graphs and that are not strongly connected to clustering.
Segmentation often refers to the division of data into groups using simple techniques; e.g., an
image can be split into segments based only on pixel intensity and color, or people can be
divided into groups based on their income. Nonetheless, some work in graph partitioning and in
image and market segmentation is related to cluster analysis.
Hierarchical versus Partitional: The most commonly discussed distinction among different
types of clustering is whether the set of clusters is nested or unnested, or in more traditional
terminology, hierarchical or partitional. A partitional clustering is simply a division of the set
of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one
subset.
If we permit clusters to have sub clusters, then we obtain a hierarchical clustering, which is a
set of nested clusters that are organized as a tree. Each node (cluster) in the tree (except for the
leaf nodes) is the union of its children (sub clusters), and the root of the tree is the cluster
containing all the objects. Often, but not always, the leaves of the tree are singleton clusters of
individual data objects.
Exclusive versus Overlapping versus Fuzzy: There are many situations in which a point could
reasonably be placed in more than one cluster, and these situations are better addressed by non-
exclusive clustering. In the most general sense, an overlapping or non-exclusive clustering is
used to reflect the fact that an object can simultaneously belong to more than one group (class).
24
For instance, a person at a university can be both an enrolled student and an employee of the
university. A non-exclusive clustering is also often used when, for example, an object is
“between” two or more clusters and could reasonably be assigned to any of these clusters.
In a fuzzy clustering, every object belongs to every cluster with a membership weight that is
between 0 (absolutely doesn’t belong) and 1 (absolutely belongs). In other words, clusters are
treated as fuzzy sets. (Mathematically, a fuzzy set is one in which an object belongs to any set
with a weight that is between 0 and 1. In fuzzy clustering, we often impose the additional
constraint that the sum of the weights for each object must equal 1.) Similarly, probabilistic
clustering techniques compute the probability with which each point belongs to each cluster, and
these probabilities must also sum to 1. Because the membership weights or probabilities for any
object sum to 1, a fuzzy or probabilistic clustering does not address true multiclass situations,
such as the case of a student employee, where an object belongs to multiple classes.
Instead, these approaches are most appropriate for avoiding the arbitrariness of assigning an
object to only one cluster when it may be close to several. In practice, a fuzzy or probabilistic
clustering is often converted to an exclusive clustering by assigning each object to the cluster in
which its membership weight or probability is highest.
Complete versus Partial: A complete clustering assigns every object to a cluster, whereas a
partial clustering does not. The motivation for a partial clustering is that some objects in a data
set may not belong to well-defined groups. Many times objects in the data set may represent
noise, outliers, or “uninteresting background.” For example, some newspaper stories may share a
common theme, such as global warming, while other stories are more generic or one-of-a-kind.
Thus, to find the important topics in last month’s stories, we may want to search only for clusters
of documents that are tightly related by a common theme. In other cases, a complete clustering of
the objects is desired. For example, an application that uses clustering to organize documents for
browsing needs to guarantee that all documents can be browsed.
regions are classified as noise and omitted; thus, DBSCAN does not produce a complete
clustering.
5.4.1 K-means
Prototype-based clustering techniques create a one-level partitioning of the data objects. There
are a number of such techniques, but two of the most prominent are K-means and K-medoid. K-
means defines a prototype in terms of a centroid, which is usually the mean of a group of points,
and is typically applied to objects in a continuous n-dimensional space.
K-medoid defines a prototype in terms of a medoid, which is the most representative point for a
group of points, and can be applied to a wide range of data since it requires only a proximity
measure for a pair of objects. While a centroid almost never corresponds to an actual data point,
a medoid, by its definition, must be an actual data point. In this section, we will focus solely on
K-means, which is one of the oldest and most widely used clustering algorithms.
K-means is formally described by Algorithm 4.1. In the figures displaying K-means clustering,
each subfigure shows (1) the centroids at the start of the iteration and (2) the assignment of the
points to those centroids. The centroids are indicated by the “+” symbol; all points belonging to
the same cluster have the same marker shape.
The points are assigned to the initial centroids, which are all in the larger group of points. For
this example, we use the mean as the centroid. After points are assigned to a centroid, the
centroid is then updated. Again, the figure for each step shows the centroid at the beginning of
the step and the assignment of points to those centroids. In the second step, points are assigned to
the updated centroids, and the centroids are updated again. When the K-means algorithm
terminates no more changes occur, the centroids have identified the natural groupings of points.
For some combinations of proximity functions and types of centroids, K-means always
converges to a solution; i.e., K-means reaches a state in which no points are shifting from one
cluster to another, and hence, the centroids don’t change. Because most of the convergence
occurs in the early steps, however, the condition on line 5 of Algorithm 8.1 is often replaced by a
weaker condition, e.g., repeat until only 1% of the points change clusters.
27
CHAPTER - 6
WEKA
WEKA has Graphical User Interfaces which are used to do all the processing work. The three
graphical user interfaces are:
“The Explorer” (exploratory data analysis)
“The Experimenter” (experimental environment)
“The Knowledge Flow” (new process model inspired interface)
6.3 Visualization
This panel helps us understand the relationship between different attributes and their values of
the data instances by providing a visual representation in the form of different graphs plotted
with different attributes set as X and Y axis.
31
CHAPTER – 7
RESULTS & DISCUSSIONS
7.1 OUTCOME
Classified
A b as
210 0 a = Abnormal
100 0 B = Normal
32
Classified
A B as
181 29 a = Abnormal
32 68 b= Normal
33
A b Classified as
189 21 a = Abnormal
36 64 b = Normal
34
A b Classified as
161 49 a = Abnormal
32 68 b = Normal
35
Figure 7.5 Applying classification in Vertebral Column 2C Data Set Naïve Bayes
a B Classified as
154 56 a = Abnormal
13 87 b = Normal
Here it is evident that the J48 classifier produces the most accurate result. Hence
we can use J48 for future use in machine learning for better and more precise
results.
36
This is another dataset that we are using. This dataset is of 2012 U.S. Presidential Election.
Here is the snapshot of the Pre-Poll Survey dataset that we have used. It’s in comma separated
values (CSV) format.
Figure 7.8 Attributes and Instances in the dataset as shown by WEKA tool.
Here Figure 7.8 displays the attributes and the instances that are displayed by the WEKA tool when the dataset
is given as an input.
Figure 7.9 The two main class of the dataset represented by Democrat & Republican
In Fig7.9 the Blue column represents the Democrat and the Red column represents the Republican.
38
Figure 7.10 This is the water project cost sharing instance which shows that equal number of yes and no.
Figure 7.11 This is the adoption of budget instance which shows that democratic won.
39
Figure 7.12 This is the anti-satellite test instance which shows that democratic won.
Figure 7.13 This is the education spending instance which shows that republic won.
40
Figure 7.14 This is the crime instance which shows that republican won.
Figure 7.15 This is the duty free exports instance which shows that democratic won.
41
Here using the Naïve Bayes classifier we got the accuracy of 90.1149%.
So by comparing the various Classification algorithms that we have applied on the Pre-Poll dataset we can see
that the J48 classifier gives the most accurate results and showing that there was an error of 3.6782%.
From the confusion matrix we can interpret that the number of instances under democrat are 259 and 8 are
incorrect. Similarly there are 160 instances under republican and 8 instances are incorrect.
While using various classification algorithms, we have used 10-fold cross validations. Cross Validation is a
standard evaluation technique which divides the dataset into 10 pieces or folds. It then holds out each fold turn
wise for testing and training the rest of the 9 folds. This process in return gives 10 results, which are then
averaged and a single result is presented.
Some other factors that we get from the classifier output are:
1. TP Rate - These are the instances which are correctly classified in the class.
2. FP Rate - These are the instances which are falsely classified in the class.
3. Precision - It is the proportion of the instances that are actually of the class divided by the total
instances that have been classified as the class.
4. Recall - It is the proportion of the instances that are classified as given class divided by the actual total
number of instances in that class. Recall is equal to the TP Rate.
5. F-Measure - It is the combined measure for precision and recall. It is calculated as 2 * Precision *
Recall / (Precision + Recall).
6. Kappa Statistics - It is a change corrected measure of the agreement between the classification and
the true classes.
44
The Cluster-0 has 214 instances which is 49% of the total available instances.
The Cluster-1 has 221 instances which is 51% of the total available instances.
In the real voting conducted in 2012, the Democrats got 51.1% of the total votes. Our model also predicted
51% of the total instances. Thereby showing that our analysis of the Pre-Poll Survey was almost same and
was a difference of 0.1% which is very high accuracy for such a large database.
In this graph the blue cross represents the democrat and the red cross represents the republican.
Here “y” represents “yes” and “n” represents “no”
Fig 7.23 Graph with Number of instances on X-Axis and water-project-cost-sharing instance on Y-Axis.
So we can see that there are almost equal number of red and blue cross in both y and n part of the y axis.
Therefore both democrat and republican got almost equal number of votes.
46
Fig 7.24 Graph with Number of instances on X-Axis and adoption-of-the-budget-resolution on Y-Axis.
So we can see that there are more blue crosses in “y” part of the y axis. Therefore democrat got more votes.
Fig 7.25 Graph with Number of instances on X-Axis and antisatellite-test-ban on Y-Axis.
So we can see that there are more blue crosses in “y” part of the y axis. Therefore democrat got more votes.
47
Fig 7.26 Graph with Number of instances on X-Axis and education-spending on Y-Axis.
So we can see that there are more red crosses in “y” part of the y axis. Therefore republican got more votes.
Fig 7.27 Graph with Number of instances on X-Axis and crime on Y-Axis.
So we can see that there are more red crosses in “y” part of the y axis. Therefore republican got more votes.
48
Fig 7.28 Graph with Number of instances on X-Axis and duty-free-export on Y-Axis.
So we can see that there are more blue crosses in “y” part of the y axis. Therefore democrat got more votes.
49
FUTURE WORK
● Application of more efficient Machine Learning Algorithms (Like Support Vector
Machines, etc.)
● Analyze data better and visualize the analyzed data in the form of more graphs.
● Use the gained experience and knowledge to use machine learning for using in
Artificial Neural Networking (ANN) for better predictive analysis.
7.3 SUMMARY
In this project we studied about Machine Learning and learned to apply various different
methods and understand their results and accuracy. For supervised machine learning we applied
the classification and for the unsupervised learning we applied the clustering method.
We also learned about the software WEKA and its different components which were used to
effectively study, analyze, visualize and fully understand different datasets. We mainly used the
Explorer component of WEKA to apply different classification and clustering algorithms.
For our project we have used medical datasets. All the datasets used in this project were taken
from the UCI repository .which includes:
Vertebral Column 2C Data Set.
Vertebral Column 3C Data Set.
Mammographic-Masses Data Set.
Cryotherapy Data Set.
Haberman’s Survival Data Set.
U.S. Presidential 2012 Pre-Poll Election
Similar work has been done on 15 more datasets from the same repository.
For classification we applied different models like J48, Rep-tree, Zero-R, One-R, Naïve bayes.
We observed that out of all classification models the J48 decision tree had the best prediction
accuracy. For clustering we used the simple K-means algorithm on the datasets.
50
7.4 CONCLUSION
After comparing the results of different models, finally we can conclude that for the used
medical datasets, J48 model produced the most accurate result for classification and for
clustering, simple K-means produced the most optimal result.
WEKA provided us with feasible means to effectively study and understand large datasets
without any explicit programming or extensive assistance. Which allowed us to understand the
essential principles and working of Machine Learning. To apply the knowledge and experience
that we gained by doing this project requires further study and research which can be done by
using Artificial Neural Networks.
51
7.5 REFERENCES